Science Cast

An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

James DaoOctober 12, 2023 5:40am

Views (41)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

arXivPDFOctober 11, 2023 12:00am

Authors

James Dao, Yeu-Tong Lao, Can Rager, Jett Janiak

Abstract

We provide concrete evidence for memory management in a 4-layer transformer. Specifically, we identify clean-up behavior, in which model components consistently remove the output of preceeding components during a forward pass. Our findings suggest that the interpretability technique Direct Logit Attribution provides misleading results. We show explicit examples where this technique is inaccurate, as it does not account for clean-up behavior.

TwitterandLinkedIn

0 comments

Add comment

An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments