An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l
James Dao, Yeu-Tong Lao, Can Rager, Jett Janiak
AbstractWe provide concrete evidence for memory management in a 4-layer transformer. Specifically, we identify clean-up behavior, in which model components consistently remove the output of preceeding components during a forward pass. Our findings suggest that the interpretability technique Direct Logit Attribution provides misleading results. We show explicit examples where this technique is inaccurate, as it does not account for clean-up behavior.