Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

17 May 2024

Jake Mendel

Stefan Heimersheim

Nicholas Goldowsky-Dill

Kaarel Hänni

Marius Hobbhahn

Papers citing "Using Degeneracy in the Loss Landscape for Mechanistic Interpretability"

7 / 7 papers shown

Title
Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition Brianna Chrisman Lucius Bushnaq Lee D. Sharkey 39 0 0 31 Mar 2025
NGD converges to less degenerate solutions than SGD Moosa Saghir N. R. Raghavendra Zihe Liu Evan Ryan Gunter 25 0 0 07 Sep 2024
Cluster-norm for Unsupervised Probing of Knowledge Walter Laurito Sharan Maiya Grégoire Dhimoïla Owen Owen Yeung Kaarel Hänni 27 2 0 26 Jul 2024
Weight-based Decomposition: A Case for Bilinear MLPs Michael T. Pearce Thomas Dooms Alice Rigg 42 1 0 06 Jun 2024
Review and Prospect of Algebraic Research in Equivalent Framework between Statistical Mechanics and Machine Learning Theory Sumio Watanabe 25 1 0 31 May 2024
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks Lucius Bushnaq Stefan Heimersheim Nicholas Goldowsky-Dill Dan Braun Jake Mendel Kaarel Hänni Avery Griffin Jörn Stöhler Magdalena Wache Marius Hobbhahn FAtt 33 3 0 17 May 2024
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 491 0 01 Nov 2022