Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

17 May 2024

Jake Mendel

Stefan Heimersheim

Nicholas Goldowsky-Dill

Kaarel Hänni

Marius Hobbhahn

Papers citing "Using Degeneracy in the Loss Landscape for Mechanistic Interpretability"

3 / 3 papers shown

Title
Review and Prospect of Algebraic Research in Equivalent Framework between Statistical Mechanics and Machine Learning Theory Sumio Watanabe 25 1 0 31 May 2024
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks Lucius Bushnaq Stefan Heimersheim Nicholas Goldowsky-Dill Dan Braun Jake Mendel Kaarel Hänni Avery Griffin Jörn Stöhler Magdalena Wache Marius Hobbhahn FAtt 28 3 0 17 May 2024
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 491 0 01 Nov 2022