Circuit Breaking: Removing Model Behaviors with Targeted Ablation

v1v2 (latest)

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

12 September 2023

ArXiv (abs)PDF HTML

Papers citing "Circuit Breaking: Removing Model Behaviors with Targeted Ablation"

13 / 13 papers shown

Title
Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation Yasser Hamidullah Koel Dutta Chowdury Yusser Al Ghussin Shakib Yazdani Cennet Oguz Josef van Genabith C. España-Bonet 133 0 0 21 Oct 2025
Interpretability as Alignment: Making Internal Understanding a Design Principle Aadit Sengupta Pratinav Seth Vinay Kumar Sankarapu AI4CE AAML 121 0 0 10 Sep 2025
IF-GUIDE: Influence Function-Guided Detoxification of LLMs Zachary Coalson Juhan Bae Nicholas Carlini Sanghyun Hong TDI 349 1 0 02 Jun 2025
Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries Tianyi Lorena Yan Robin Jia KELM MU 276 0 0 27 Feb 2025
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks Aaron Mueller CML 192 16 0 05 Jul 2024
Sheaf Discovery with Joint Computation Graph Pruning and Flexible Granularity Lei Yu Jingcheng Niu Zining Zhu Xi Chen Gerald Penn 175 9 0 04 Jul 2024
Knowledge Circuits in Pretrained Transformers Yunzhi Yao Ningyu Zhang Zekun Xi Meng Wang Ziwen Xu Shumin Deng Huajun Chen KELM 349 41 0 28 May 2024
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov Georg Lange Neel Nanda 315 58 0 14 May 2024
Decomposing and Editing Predictions by Modeling Model Computation Harshay Shah Andrew Ilyas Aleksander Madry KELM 258 23 0 17 Apr 2024
pyvene: A Library for Understanding and Improving PyTorch Models via InterventionsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 Zhengxuan Wu Atticus Geiger Aryaman Arora Jing-ling Huang Zheng Wang Noah D. Goodman Christopher D. Manning Christopher Potts MU 209 43 0 12 Mar 2024
SoK: Memorization in General-Purpose Large Language Models Valentin Hartmann Anshuman Suri Vincent Bindschaedler David Evans Shruti Tople Robert West KELM LLMAG 288 35 0 24 Oct 2023
NeuroSurgeon: A Toolkit for Subnetwork Analysis Michael A. Lepori Ellie Pavlick Thomas Serre 168 8 0 01 Sep 2023
A Unified Approach to Interpreting Model Predictions Scott M. Lundberg Su-In Lee FAtt 2.7K 28,515 0 22 May 2017