Low-Complexity Probing via Finding Subnetworks

North American Chapter of the Association for Computational Linguistics (NAACL), 2021

8 April 2021

Papers citing "Low-Complexity Probing via Finding Subnetworks"

50 / 53 papers shown

Weight-sparse transformers have interpretable circuits

237

17 Nov 2025

PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization

200

27 Oct 2025

C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression

Baptiste Bauvin

Loïc Baret

Ola Ahmad

132

21 Oct 2025

Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

28 Sep 2025

Towards Transparent AI: A Survey on Explainable Language Models

Avash Palikhe

Sribala Vidyadhari Chinta

185

25 Sep 2025

From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits

Jiaqi W. Ma

Shichang Zhang

119

22 Aug 2025

On the Performance of Concept Probing: The Influence of the Data (Extended Version)

Manuel de Sousa Ribeiro

Afonso Leote

João Leite

197

24 Jul 2025

Concept Probing: Where to Find Human-Defined Concepts (Extended Version)

Manuel de Sousa Ribeiro

Afonso Leote

João Leite

189

24 Jul 2025

What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models

639

09 Jul 2025

Stochastic Parameter Decomposition

Lucius Bushnaq

Dan Braun

Lee D. Sharkey

221

25 Jun 2025

Line of Sight: On Linear Representations in VLLMs

315

05 Jun 2025

Analyzing the Inner Workings of Transformers in Compositional GeneralizationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

Ryoma Kumon

Hitomi Yanaka

327

24 Feb 2025

Circuit Compositions: Exploring Modular Structures in Transformer-Based Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Philipp Mondorf

Sondre Wold

Yun Xue

501

02 Oct 2024

Optimal ablation for interpretabilityNeural Information Processing Systems (NeurIPS), 2024

Maximilian Li

Lucas Janson

FAtt

343

16 Sep 2024

Explaining Human Comparisons using Alignment-Importance Heatmaps

Nhut Truong

Dario Pesenti

Uri Hasson

183

08 Sep 2024

The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation AnalysisComputational Linguistics (CL), 2024

...

511

02 Aug 2024

Tracking linguistic information in transformer-based sentence embeddings through targeted sparsification

Vivi Nastase

Paola Merlo

199

25 Jul 2024

Investigating the Indirect Object Identification circuit in Mamba

Danielle Ensign

Adrià Garriga-Alonso

Mamba

170

19 Jul 2024

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

351

19 Jul 2024

Sheaf Discovery with Joint Computation Graph Pruning and Flexible Granularity

221

04 Jul 2024

Are there identifiable structural parts in the sentence embedding whole?

Vivi Nastase

Paola Merlo

200

24 Jun 2024

Finding Transformer Circuits with Edge Pruning

Adithya Bhaskar

Alexander Wettig

Dan Friedman

Danqi Chen

471

24 Jun 2024

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Charles OÑeill

Thang Bui

209

21 May 2024

Automatic Discovery of Visual Circuits

186

22 Apr 2024

Mechanistic Interpretability for AI Safety -- A Review

Leonard Bereska

E. Gavves

AI4CE

383

301

22 Apr 2024

Decomposing and Editing Predictions by Modeling Model Computation

Harshay Shah

Andrew Ilyas

Aleksander Madry

KELM

296

17 Apr 2024

Embedded Named Entity Recognition using Probing ClassifiersConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Nicholas Popovic

Michael Färber

234

18 Mar 2024

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Kaixuan Huang

Mengdi Wang

331

174

07 Feb 2024

Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two BenchmarksNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Ting-Yun Chang

Jesse Thomason

Robin Jia

323

15 Nov 2023

Uncovering Intermediate Variables in Transformers using Circuit Probing

Michael A. Lepori

Thomas Serre

Ellie Pavlick

399

07 Nov 2023

Attribution Patching Outperforms Automated Circuit DiscoveryBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023

Aaquib Syed

Can Rager

Arthur Conmy

369

102

16 Oct 2023

SPADE: Sparsity-Guided Debugging for Deep Neural NetworksInternational Conference on Machine Learning (ICML), 2023

Arshia Soltani Moakhar

Eugenia Iofinova

Elias Frantar

Dan Alistarh

332

06 Oct 2023

Discovering Knowledge-Critical Subnetworks in Pretrained Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

261

04 Oct 2023

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

306

12 Sep 2023

NeuroSurgeon: A Toolkit for Subnetwork Analysis

Michael A. Lepori

Ellie Pavlick

Thomas Serre

202

01 Sep 2023

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

323

141

18 Jul 2023

Finding Neurons in a Haystack: Case Studies with Sparse Probing

540

291

02 May 2023

Towards Automated Circuit Discovery for Mechanistic InterpretabilityNeural Information Processing Systems (NeurIPS), 2023

Arthur Conmy

Augustine N. Mavor-Parker

Aengus Lynch

Stefan Heimersheim

Adrià Garriga-Alonso

540

452

28 Apr 2023

Break It Down: Evidence for Structural Compositionality in Neural NetworksNeural Information Processing Systems (NeurIPS), 2023

Michael A. Lepori

Thomas Serre

Ellie Pavlick

335

26 Jan 2023

CREPE: Can Vision-Language Foundation Models Reason Compositionally?Computer Vision and Pattern Recognition (CVPR), 2022

376

183

13 Dec 2022

The Architectural Bottleneck PrincipleConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

190

11 Nov 2022

SocioProbe: What, When, and Where Language Models Learn about SociodemographicsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

221

08 Nov 2022

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic TaskInternational Conference on Learning Representations (ICLR), 2022

601

386

24 Oct 2022

The Open-World Lottery Ticket Hypothesis for OOD Intent ClassificationInternational Conference on Language Resources and Evaluation (LREC), 2022

Xipeng Qiu

333

13 Oct 2022

Probing via PromptingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2022

Jiaoda Li

Robert Bamler

Mrinmaya Sachan

261

04 Jul 2022

Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model

225

24 May 2022

Visualizing the Relationship Between Encoded Linguistic Information and Task PerformanceFindings (Findings), 2022

Defu Lian

Taro Watanabe

142

29 Mar 2022

The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail

Sam Bowman

OffRL

373

15 Oct 2021

Conditional probing: measuring usable information beyond a baseline

John Hewitt

Kawin Ethayarajh

Abigail Z. Jacobs

Christopher D. Manning

208

19 Sep 2021

How Does Adversarial Fine-Tuning Benefit BERT?

J. Ebrahimi

Hao Yang

Wei Zhang

AAML

254

31 Aug 2021