Opening the AI black box: program synthesis via mechanistic
interpretability

Opening the AI black box: program synthesis via mechanistic interpretability

7 February 2024

Eric J. Michaud

Chloe Loughridge

Tara Rezaei Kheirkhah

Mateja Vukelić

Papers citing "Opening the AI black box: program synthesis via mechanistic interpretability"

14 / 14 papers shown

Title
Towards Understanding Distilled Reasoning Models: A Representational Approach David D. Baek Max Tegmark LRM 55 2 0 05 Mar 2025
Harmonic Loss Trains Interpretable AI Models David D. Baek Ziming Liu Riya Tyagi Max Tegmark 81 2 0 03 Feb 2025
Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning David D. Baek Yuxiao Li Max Tegmark 27 2 0 10 Oct 2024
TracrBench: Generating Interpretability Testbeds with Large Language Models Hannes Thurnherr Jérémy Scheurer 28 3 0 07 Sep 2024
Weight-based Decomposition: A Case for Bilinear MLPs Michael T. Pearce Thomas Dooms Alice Rigg 22 1 0 06 Jun 2024
Meta-Designing Quantum Experiments with Language Models Sören Arlt Haonan Duan Felix Li Sang Michael Xie Yuhuai Wu Mario Krenn AI4CE 11 3 0 04 Jun 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems David Dalrymple Joar Skalse Yoshua Bengio Stuart J. Russell Max Tegmark ... Clark Barrett Ding Zhao Zhi-Xuan Tan Jeannette Wing Joshua Tenenbaum 38 51 0 10 May 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 22 111 0 22 Apr 2024
Rethinking the Relationship between Recurrent and Non-Recurrent Neural Networks: A Study in Sparsity Quincy Hershey Randy Paffenroth Harsh Nilesh Pathak Simon Tavener 43 1 0 01 Apr 2024
Attribution Patching Outperforms Automated Circuit Discovery Aaquib Syed Can Rager Arthur Conmy 50 53 0 16 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 91 164 0 10 Oct 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 175 116 0 30 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 207 486 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 234 453 0 24 Sep 2022