Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2309.05973
Cited By
v1
v2 (latest)
Circuit Breaking: Removing Model Behaviors with Targeted Ablation
12 September 2023
Maximilian Li
Xander Davies
Max Nadeau
KELM
MU
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Circuit Breaking: Removing Model Behaviors with Targeted Ablation"
13 / 13 papers shown
Title
Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation
Yasser Hamidullah
Koel Dutta Chowdury
Yusser Al Ghussin
Shakib Yazdani
Cennet Oguz
Josef van Genabith
C. España-Bonet
133
0
0
21 Oct 2025
Interpretability as Alignment: Making Internal Understanding a Design Principle
Aadit Sengupta
Pratinav Seth
Vinay Kumar Sankarapu
AI4CE
AAML
121
0
0
10 Sep 2025
IF-GUIDE: Influence Function-Guided Detoxification of LLMs
Zachary Coalson
Juhan Bae
Nicholas Carlini
Sanghyun Hong
TDI
349
1
0
02 Jun 2025
Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries
Tianyi Lorena Yan
Robin Jia
KELM
MU
276
0
0
27 Feb 2025
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
Aaron Mueller
CML
192
16
0
05 Jul 2024
Sheaf Discovery with Joint Computation Graph Pruning and Flexible Granularity
Lei Yu
Jingcheng Niu
Zining Zhu
Xi Chen
Gerald Penn
175
9
0
04 Jul 2024
Knowledge Circuits in Pretrained Transformers
Yunzhi Yao
Ningyu Zhang
Zekun Xi
Meng Wang
Ziwen Xu
Shumin Deng
Huajun Chen
KELM
349
41
0
28 May 2024
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov
Georg Lange
Neel Nanda
315
58
0
14 May 2024
Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah
Andrew Ilyas
Aleksander Madry
KELM
258
23
0
17 Apr 2024
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Zhengxuan Wu
Atticus Geiger
Aryaman Arora
Jing-ling Huang
Zheng Wang
Noah D. Goodman
Christopher D. Manning
Christopher Potts
MU
209
43
0
12 Mar 2024
SoK: Memorization in General-Purpose Large Language Models
Valentin Hartmann
Anshuman Suri
Vincent Bindschaedler
David Evans
Shruti Tople
Robert West
KELM
LLMAG
288
35
0
24 Oct 2023
NeuroSurgeon: A Toolkit for Subnetwork Analysis
Michael A. Lepori
Ellie Pavlick
Thomas Serre
168
8
0
01 Sep 2023
A Unified Approach to Interpreting Model Predictions
Scott M. Lundberg
Su-In Lee
FAtt
2.7K
28,515
0
22 May 2017
1