An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

Authors

James Dao, Yeu-Tong Lao, Can Rager, Jett Janiak

Abstract

We provide concrete evidence for memory management in a 4-layer transformer. Specifically, we identify clean-up behavior, in which model components consistently remove the output of preceeding components during a forward pass. Our findings suggest that the interpretability technique Direct Logit Attribution provides misleading results. We show explicit examples where this technique is inaccurate, as it does not account for clean-up behavior.

Follow Us on

0 comments

Add comment