An Adversarial Example for Direct Logit Attribution: Memory Management
in gelu-4l
BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023
Abstract
We provide concrete evidence for memory management in a 4-layer transformer. Specifically, we identify clean-up behavior, in which model components consistently remove the output of preceeding components during a forward pass. Our findings suggest that the interpretability technique Direct Logit Attribution provides misleading results. We show explicit examples where this technique is inaccurate, as it does not account for clean-up behavior.
View on arXivComments on this paper
