Backward Lens: Projecting Language Model Gradients into the Vocabulary Space

20 February 2024

Papers citing "Backward Lens: Projecting Language Model Gradients into the Vocabulary Space"

3 / 3 papers shown

Title
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 62 18 0 02 Jul 2024
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 189 260 0 28 Apr 2023
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 120 314 0 21 Sep 2022