Inducing Causal Structure for Interpretable Neural Networks

1 December 2021

Papers citing "Inducing Causal Structure for Interpretable Neural Networks"

50 / 61 papers shown

Title
Divide (Text) and Conquer (Sentiment): Improved Sentiment Classification by Constituent Conflict Resolution Jan Kościałkowski Paweł Marcinkowski 14 0 0 08 May 2025
Inducing Causal Structure for Interpretable Neural Networks Applied to Glucose Prediction for T1DM Patients Ana Esponera Giovanni Cinnà BDL CML 52 0 0 18 Mar 2025
Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? Maxime Méloux Silviu Maniu François Portet Maxime Peyrard 34 0 0 28 Feb 2025
What is causal about causal models and representations? Frederik Hytting Jørgensen Luigi Gresele S. Weichwald CML 101 0 0 31 Jan 2025
Inference and Verbalization Functions During In-Context Learning Junyi Tao Xiaoyin Chen Nelson F. Liu ReLM LRM 21 0 0 12 Oct 2024
Neural Networks Decoded: Targeted and Robust Analysis of Neural Network Decisions via Causal Explanations and Reasoning A. Diallo Vaishak Belle P. Patras AAML 11 0 0 07 Oct 2024
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions Yu-Shin Huang Peter Just Krishna Narayanan Chao Tian 32 3 0 06 Oct 2024
RNR: Teaching Large Language Models to Follow Roles and Rules Kuan-Chieh Jackson Wang Alexander Bukharin Haoming Jiang Qingyu Yin Zhengyang Wang ... Chao Zhang Bing Yin Xian Li Jianshu Chen Shiyang Li ALM 26 1 0 10 Sep 2024
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs Nitay Calderon Roi Reichart 32 10 0 27 Jul 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective Meng Wang Yunzhi Yao Ziwen Xu Shuofei Qiao Shumin Deng ... Yong-jia Jiang Pengjun Xie Fei Huang Huajun Chen Ningyu Zhang 47 27 0 22 Jul 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Rohan Gupta Iván Arcuschin Thomas Kwa Adrià Garriga-Alonso 45 3 0 19 Jul 2024
Graph Neural Network Causal Explanation via Neural Causal Models Arman Behnam Binghui Wang CML 40 3 0 12 Jul 2024
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers Yibo Jiang Goutham Rajendran Pradeep Ravikumar Bryon Aragam CLL KELM 29 6 0 26 Jun 2024
Introducing Diminutive Causal Structure into Graph Representation Learning Hang Gao Peng Qiao Yifan Jin Fengge Wu Jiangmeng Li Changwen Zheng 25 4 0 13 Jun 2024
How to use and interpret activation patching Stefan Heimersheim Neel Nanda 17 36 0 23 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 38 111 0 22 Apr 2024
Scope Ambiguities in Large Language Models Gaurav Kamath Sebastian Schuster Sowmya Vajjala Siva Reddy 27 2 0 05 Apr 2024
Locating and Editing Factual Associations in Mamba Arnab Sen Sharma David Atkinson David Bau KELM 68 28 0 04 Apr 2024
ReFT: Representation Finetuning for Language Models Zhengxuan Wu Aryaman Arora Zheng Wang Atticus Geiger Daniel Jurafsky Christopher D. Manning Christopher Potts OffRL 30 58 0 04 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 44 110 0 28 Mar 2024
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions Zhengxuan Wu Atticus Geiger Aryaman Arora Jing-ling Huang Zheng Wang Noah D. Goodman Christopher D. Manning Christopher Potts MU 44 25 0 12 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 43 42 0 01 Mar 2024
CausalGym: Benchmarking causal interpretability methods on linguistic tasks Aryaman Arora Daniel Jurafsky Christopher Potts 50 21 0 19 Feb 2024
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs Bilal Chughtai Alan Cooney Neel Nanda HILM KELM 25 16 0 11 Feb 2024
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments Zhengxuan Wu Atticus Geiger Jing-ling Huang Aryaman Arora Thomas F. Icard Christopher Potts Noah D. Goodman 28 6 0 23 Jan 2024
DiConStruct: Causal Concept-based Explanations through Black-Box Distillation Ricardo Moreira Jacopo Bono Mário Cardoso Pedro Saleiro Mário A. T. Figueiredo P. Bizarro CML 15 4 0 16 Jan 2024
Emergence and Function of Abstract Representations in Self-Supervised Transformers Quentin RV. Ferry Joshua Ching Takashi Kawai 11 2 0 08 Dec 2023
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia Giovanni Monea Maxime Peyrard Martin Josifoski Vishrav Chaudhary Jason Eisner Emre Kiciman Hamid Palangi Barun Patra Robert West KELM 47 12 0 04 Dec 2023
Flexible Model Interpretability through Natural Language Model Editing Karel DÓosterlinck Thomas Demeester Chris Develder Christopher Potts MILM KELM 8 0 0 17 Nov 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 26 96 0 27 Sep 2023
Rigorously Assessing Natural Language Explanations of Neurons Jing-ling Huang Atticus Geiger Karel DÓosterlinck Zhengxuan Wu Christopher Potts MILM 16 25 0 19 Sep 2023
Circuit Breaking: Removing Model Behaviors with Targeted Ablation Maximilian Li Xander Davies Max Nadeau KELM MU 14 27 0 12 Sep 2023
The Hydra Effect: Emergent Self-repair in Language Model Computations Tom McGrath Matthew Rahtz János Kramár Vladimir Mikulik Shane Legg MILM LRM 13 68 0 28 Jul 2023
Discovering Variable Binding Circuitry with Desiderata Xander Davies Max Nadeau Nikhil Prakash Tamar Rott Shaham David Bau 21 12 0 07 Jul 2023
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks Zhaofeng Wu Linlu Qiu Alexis Ross Ekin Akyürek Boyuan Chen Bailin Wang Najoung Kim Jacob Andreas Yoon Kim LRM ReLM 35 192 0 05 Jul 2023
Minimum Levels of Interpretability for Artificial Moral Agents Avish Vijayaraghavan C. Badea AI4CE 25 5 0 02 Jul 2023
LEACE: Perfect linear concept erasure in closed form Nora Belrose David Schneider-Joseph Shauli Ravfogel Ryan Cotterell Edward Raff Stella Biderman KELM MU 41 102 0 06 Jun 2023
ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning Jingyuan Selena She Christopher Potts Sam Bowman Atticus Geiger 8 13 0 30 May 2023
Has It All Been Solved? Open NLP Research Questions Not Solved by Large Language Models Oana Ignat Zhijing Jin Artem Abzaliev Laura Biester Santiago Castro ... Verónica Pérez-Rosas Siqi Shen Zekun Wang Winston Wu Rada Mihalcea LRM 24 6 0 21 May 2023
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Zhengxuan Wu Atticus Geiger Thomas Icard Christopher Potts Noah D. Goodman MILM 17 81 0 15 May 2023
Estimating the Causal Effects of Natural Logic Features in Neural NLI Models Julia Rozanova Marco Valentino André Freitas CML 19 4 0 15 May 2023
Localizing Model Behavior with Path Patching Nicholas W. Goldowsky-Dill Chris MacLeod L. Sato Aryaman Arora 8 85 0 12 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Competence-Based Analysis of Language Models Adam Davies Jize Jiang Chengxiang Zhai ELM 21 4 0 01 Mar 2023
Analyzing And Editing Inner Mechanisms Of Backdoored Language Models Max Lamparth Anka Reuel KELM 28 10 0 24 Feb 2023
A Survey of Methods, Challenges and Perspectives in Causality Gael Gendron Michael Witbrock Gillian Dobbie OOD AI4CE CML 12 12 0 01 Feb 2023
Introducing Expertise Logic into Graph Representation Learning from A Causal Perspective Hang Gao Jiangmeng Li Wenwen Qiang Lingyu Si Xingzhe Su Feng Wu Changwen Zheng Fuchun Sun 24 0 0 20 Jan 2023
Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training Jing-ling Huang Zhengxuan Wu Kyle Mahowald Christopher Potts 19 13 0 19 Dec 2022
Explainability Via Causal Self-Talk Nicholas A. Roy Junkyung Kim Neil C. Rabinowitz CML 6 7 0 17 Nov 2022
Neural Bayesian Network Understudy Paloma Rabaey Cedric De Boom Thomas Demeester BDL CML 14 0 0 15 Nov 2022