CausalGym: Benchmarking causal interpretability methods on linguistic tasks

19 February 2024

Papers citing "CausalGym: Benchmarking causal interpretability methods on linguistic tasks"

8 / 8 papers shown

Title
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages Jannik Brinkmann Chris Wendler Christian Bartelt Aaron Mueller 35 9 0 10 Jan 2025
Language models align with human judgments on key grammatical constructions Jennifer Hu Kyle Mahowald G. Lupyan Anna A. Ivanova Roger Levy 22 10 0 19 Jan 2024
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 83 164 0 10 Oct 2023
A Geometric Notion of Causal Probing Clément Guerner Anej Svete Tianyu Liu Alex Warstadt Ryan Cotterell LLMSV 24 12 0 27 Jul 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 205 486 0 01 Nov 2022
Naturalistic Causal Probing for Morpho-Syntax Afra Amini Tiago Pimentel Clara Meister Ryan Cotterell MILM 93 13 0 14 May 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 216 291 0 24 Feb 2021