ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.05110
  4. Cited By
Opening the AI black box: program synthesis via mechanistic
  interpretability

Opening the AI black box: program synthesis via mechanistic interpretability

7 February 2024
Eric J. Michaud
Isaac Liao
Vedang Lad
Ziming Liu
Anish Mudide
Chloe Loughridge
Zifan Carl Guo
Tara Rezaei Kheirkhah
Mateja Vukelić
Max Tegmark
ArXivPDFHTML

Papers citing "Opening the AI black box: program synthesis via mechanistic interpretability"

14 / 14 papers shown
Title
Towards Understanding Distilled Reasoning Models: A Representational Approach
Towards Understanding Distilled Reasoning Models: A Representational Approach
David D. Baek
Max Tegmark
LRM
55
2
0
05 Mar 2025
Harmonic Loss Trains Interpretable AI Models
Harmonic Loss Trains Interpretable AI Models
David D. Baek
Ziming Liu
Riya Tyagi
Max Tegmark
81
2
0
03 Feb 2025
Generalization from Starvation: Hints of Universality in LLM Knowledge
  Graph Learning
Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning
David D. Baek
Yuxiao Li
Max Tegmark
27
2
0
10 Oct 2024
TracrBench: Generating Interpretability Testbeds with Large Language
  Models
TracrBench: Generating Interpretability Testbeds with Large Language Models
Hannes Thurnherr
Jérémy Scheurer
28
3
0
07 Sep 2024
Weight-based Decomposition: A Case for Bilinear MLPs
Weight-based Decomposition: A Case for Bilinear MLPs
Michael T. Pearce
Thomas Dooms
Alice Rigg
22
1
0
06 Jun 2024
Meta-Designing Quantum Experiments with Language Models
Meta-Designing Quantum Experiments with Language Models
Sören Arlt
Haonan Duan
Felix Li
Sang Michael Xie
Yuhuai Wu
Mario Krenn
AI4CE
11
3
0
04 Jun 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable
  AI Systems
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
David Dalrymple
Joar Skalse
Yoshua Bengio
Stuart J. Russell
Max Tegmark
...
Clark Barrett
Ding Zhao
Zhi-Xuan Tan
Jeannette Wing
Joshua Tenenbaum
38
51
0
10 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
22
111
0
22 Apr 2024
Rethinking the Relationship between Recurrent and Non-Recurrent Neural
  Networks: A Study in Sparsity
Rethinking the Relationship between Recurrent and Non-Recurrent Neural Networks: A Study in Sparsity
Quincy Hershey
Randy Paffenroth
Harsh Nilesh Pathak
Simon Tavener
43
1
0
01 Apr 2024
Attribution Patching Outperforms Automated Circuit Discovery
Attribution Patching Outperforms Automated Circuit Discovery
Aaquib Syed
Can Rager
Arthur Conmy
50
53
0
16 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model
  Representations of True/False Datasets
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
91
164
0
10 Oct 2023
How does GPT-2 compute greater-than?: Interpreting mathematical
  abilities in a pre-trained language model
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna
Ollie Liu
Alexandre Variengien
LRM
175
116
0
30 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
207
486
0
01 Nov 2022
In-context Learning and Induction Heads
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
234
453
0
24 Sep 2022
1