Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.05110
Cited By
Opening the AI black box: program synthesis via mechanistic interpretability
7 February 2024
Eric J. Michaud
Isaac Liao
Vedang Lad
Ziming Liu
Anish Mudide
Chloe Loughridge
Zifan Carl Guo
Tara Rezaei Kheirkhah
Mateja Vukelić
Max Tegmark
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Opening the AI black box: program synthesis via mechanistic interpretability"
14 / 14 papers shown
Title
Towards Understanding Distilled Reasoning Models: A Representational Approach
David D. Baek
Max Tegmark
LRM
55
2
0
05 Mar 2025
Harmonic Loss Trains Interpretable AI Models
David D. Baek
Ziming Liu
Riya Tyagi
Max Tegmark
81
2
0
03 Feb 2025
Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning
David D. Baek
Yuxiao Li
Max Tegmark
27
2
0
10 Oct 2024
TracrBench: Generating Interpretability Testbeds with Large Language Models
Hannes Thurnherr
Jérémy Scheurer
28
3
0
07 Sep 2024
Weight-based Decomposition: A Case for Bilinear MLPs
Michael T. Pearce
Thomas Dooms
Alice Rigg
22
1
0
06 Jun 2024
Meta-Designing Quantum Experiments with Language Models
Sören Arlt
Haonan Duan
Felix Li
Sang Michael Xie
Yuhuai Wu
Mario Krenn
AI4CE
11
3
0
04 Jun 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
David Dalrymple
Joar Skalse
Yoshua Bengio
Stuart J. Russell
Max Tegmark
...
Clark Barrett
Ding Zhao
Zhi-Xuan Tan
Jeannette Wing
Joshua Tenenbaum
38
51
0
10 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
22
111
0
22 Apr 2024
Rethinking the Relationship between Recurrent and Non-Recurrent Neural Networks: A Study in Sparsity
Quincy Hershey
Randy Paffenroth
Harsh Nilesh Pathak
Simon Tavener
43
1
0
01 Apr 2024
Attribution Patching Outperforms Automated Circuit Discovery
Aaquib Syed
Can Rager
Arthur Conmy
50
53
0
16 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
91
164
0
10 Oct 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna
Ollie Liu
Alexandre Variengien
LRM
175
116
0
30 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
207
486
0
01 Nov 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
234
453
0
24 Sep 2022
1