Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2112.00826
Cited By
Inducing Causal Structure for Interpretable Neural Networks
1 December 2021
Atticus Geiger
Zhengxuan Wu
Hanson Lu
J. Rozner
Elisa Kreiss
Thomas F. Icard
Noah D. Goodman
Christopher Potts
CML
OOD
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Inducing Causal Structure for Interpretable Neural Networks"
50 / 61 papers shown
Title
Divide (Text) and Conquer (Sentiment): Improved Sentiment Classification by Constituent Conflict Resolution
Jan Kościałkowski
Paweł Marcinkowski
14
0
0
08 May 2025
Inducing Causal Structure for Interpretable Neural Networks Applied to Glucose Prediction for T1DM Patients
Ana Esponera
Giovanni Cinnà
BDL
CML
52
0
0
18 Mar 2025
Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?
Maxime Méloux
Silviu Maniu
François Portet
Maxime Peyrard
34
0
0
28 Feb 2025
What is causal about causal models and representations?
Frederik Hytting Jørgensen
Luigi Gresele
S. Weichwald
CML
101
0
0
31 Jan 2025
Inference and Verbalization Functions During In-Context Learning
Junyi Tao
Xiaoyin Chen
Nelson F. Liu
ReLM
LRM
21
0
0
12 Oct 2024
Neural Networks Decoded: Targeted and Robust Analysis of Neural Network Decisions via Causal Explanations and Reasoning
A. Diallo
Vaishak Belle
P. Patras
AAML
11
0
0
07 Oct 2024
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions
Yu-Shin Huang
Peter Just
Krishna Narayanan
Chao Tian
32
3
0
06 Oct 2024
RNR: Teaching Large Language Models to Follow Roles and Rules
Kuan-Chieh Jackson Wang
Alexander Bukharin
Haoming Jiang
Qingyu Yin
Zhengyang Wang
...
Chao Zhang
Bing Yin
Xian Li
Jianshu Chen
Shiyang Li
ALM
26
1
0
10 Sep 2024
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs
Nitay Calderon
Roi Reichart
32
10
0
27 Jul 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Meng Wang
Yunzhi Yao
Ziwen Xu
Shuofei Qiao
Shumin Deng
...
Yong-jia Jiang
Pengjun Xie
Fei Huang
Huajun Chen
Ningyu Zhang
47
27
0
22 Jul 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Rohan Gupta
Iván Arcuschin
Thomas Kwa
Adrià Garriga-Alonso
45
3
0
19 Jul 2024
Graph Neural Network Causal Explanation via Neural Causal Models
Arman Behnam
Binghui Wang
CML
40
3
0
12 Jul 2024
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Yibo Jiang
Goutham Rajendran
Pradeep Ravikumar
Bryon Aragam
CLL
KELM
29
6
0
26 Jun 2024
Introducing Diminutive Causal Structure into Graph Representation Learning
Hang Gao
Peng Qiao
Yifan Jin
Fengge Wu
Jiangmeng Li
Changwen Zheng
25
4
0
13 Jun 2024
How to use and interpret activation patching
Stefan Heimersheim
Neel Nanda
17
36
0
23 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
38
111
0
22 Apr 2024
Scope Ambiguities in Large Language Models
Gaurav Kamath
Sebastian Schuster
Sowmya Vajjala
Siva Reddy
27
2
0
05 Apr 2024
Locating and Editing Factual Associations in Mamba
Arnab Sen Sharma
David Atkinson
David Bau
KELM
68
28
0
04 Apr 2024
ReFT: Representation Finetuning for Language Models
Zhengxuan Wu
Aryaman Arora
Zheng Wang
Atticus Geiger
Daniel Jurafsky
Christopher D. Manning
Christopher Potts
OffRL
30
58
0
04 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
44
110
0
28 Mar 2024
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Zhengxuan Wu
Atticus Geiger
Aryaman Arora
Jing-ling Huang
Zheng Wang
Noah D. Goodman
Christopher D. Manning
Christopher Potts
MU
44
25
0
12 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
43
42
0
01 Mar 2024
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Aryaman Arora
Daniel Jurafsky
Christopher Potts
50
21
0
19 Feb 2024
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Bilal Chughtai
Alan Cooney
Neel Nanda
HILM
KELM
25
16
0
11 Feb 2024
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
Zhengxuan Wu
Atticus Geiger
Jing-ling Huang
Aryaman Arora
Thomas F. Icard
Christopher Potts
Noah D. Goodman
28
6
0
23 Jan 2024
DiConStruct: Causal Concept-based Explanations through Black-Box Distillation
Ricardo Moreira
Jacopo Bono
Mário Cardoso
Pedro Saleiro
Mário A. T. Figueiredo
P. Bizarro
CML
15
4
0
16 Jan 2024
Emergence and Function of Abstract Representations in Self-Supervised Transformers
Quentin RV. Ferry
Joshua Ching
Takashi Kawai
11
2
0
08 Dec 2023
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia
Giovanni Monea
Maxime Peyrard
Martin Josifoski
Vishrav Chaudhary
Jason Eisner
Emre Kiciman
Hamid Palangi
Barun Patra
Robert West
KELM
47
12
0
04 Dec 2023
Flexible Model Interpretability through Natural Language Model Editing
Karel DÓosterlinck
Thomas Demeester
Chris Develder
Christopher Potts
MILM
KELM
8
0
0
17 Nov 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
26
96
0
27 Sep 2023
Rigorously Assessing Natural Language Explanations of Neurons
Jing-ling Huang
Atticus Geiger
Karel DÓosterlinck
Zhengxuan Wu
Christopher Potts
MILM
16
25
0
19 Sep 2023
Circuit Breaking: Removing Model Behaviors with Targeted Ablation
Maximilian Li
Xander Davies
Max Nadeau
KELM
MU
14
27
0
12 Sep 2023
The Hydra Effect: Emergent Self-repair in Language Model Computations
Tom McGrath
Matthew Rahtz
János Kramár
Vladimir Mikulik
Shane Legg
MILM
LRM
13
68
0
28 Jul 2023
Discovering Variable Binding Circuitry with Desiderata
Xander Davies
Max Nadeau
Nikhil Prakash
Tamar Rott Shaham
David Bau
21
12
0
07 Jul 2023
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
Zhaofeng Wu
Linlu Qiu
Alexis Ross
Ekin Akyürek
Boyuan Chen
Bailin Wang
Najoung Kim
Jacob Andreas
Yoon Kim
LRM
ReLM
35
192
0
05 Jul 2023
Minimum Levels of Interpretability for Artificial Moral Agents
Avish Vijayaraghavan
C. Badea
AI4CE
25
5
0
02 Jul 2023
LEACE: Perfect linear concept erasure in closed form
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Ryan Cotterell
Edward Raff
Stella Biderman
KELM
MU
41
102
0
06 Jun 2023
ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning
Jingyuan Selena She
Christopher Potts
Sam Bowman
Atticus Geiger
8
13
0
30 May 2023
Has It All Been Solved? Open NLP Research Questions Not Solved by Large Language Models
Oana Ignat
Zhijing Jin
Artem Abzaliev
Laura Biester
Santiago Castro
...
Verónica Pérez-Rosas
Siqi Shen
Zekun Wang
Winston Wu
Rada Mihalcea
LRM
24
6
0
21 May 2023
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Zhengxuan Wu
Atticus Geiger
Thomas Icard
Christopher Potts
Noah D. Goodman
MILM
17
81
0
15 May 2023
Estimating the Causal Effects of Natural Logic Features in Neural NLI Models
Julia Rozanova
Marco Valentino
André Freitas
CML
19
4
0
15 May 2023
Localizing Model Behavior with Path Patching
Nicholas W. Goldowsky-Dill
Chris MacLeod
L. Sato
Aryaman Arora
8
85
0
12 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas F. Icard
Noah D. Goodman
CML
73
98
0
05 Mar 2023
Competence-Based Analysis of Language Models
Adam Davies
Jize Jiang
Chengxiang Zhai
ELM
21
4
0
01 Mar 2023
Analyzing And Editing Inner Mechanisms Of Backdoored Language Models
Max Lamparth
Anka Reuel
KELM
28
10
0
24 Feb 2023
A Survey of Methods, Challenges and Perspectives in Causality
Gael Gendron
Michael Witbrock
Gillian Dobbie
OOD
AI4CE
CML
12
12
0
01 Feb 2023
Introducing Expertise Logic into Graph Representation Learning from A Causal Perspective
Hang Gao
Jiangmeng Li
Wenwen Qiang
Lingyu Si
Xingzhe Su
Feng Wu
Changwen Zheng
Fuchun Sun
24
0
0
20 Jan 2023
Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training
Jing-ling Huang
Zhengxuan Wu
Kyle Mahowald
Christopher Potts
19
13
0
19 Dec 2022
Explainability Via Causal Self-Talk
Nicholas A. Roy
Junkyung Kim
Neil C. Rabinowitz
CML
6
7
0
17 Nov 2022
Neural Bayesian Network Understudy
Paloma Rabaey
Cedric De Boom
Thomas Demeester
BDL
CML
14
0
0
15 Nov 2022
1
2
Next