Mechanistic Interpretability for AI Safety -- A Review

22 April 2024

Papers citing "Mechanistic Interpretability for AI Safety -- A Review"

50 / 62 papers shown

Title
Geospatial Mechanistic Interpretability of Large Language Models Stef De Sabbata Stefano Mizzaro Kevin Roitero AI4CE 21 0 0 06 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Kola Ayonrinde Louis Jaburi XAI 55 1 0 02 May 2025
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i Kola Ayonrinde Louis Jaburi MILM 68 1 0 01 May 2025
Characterizing AI Agents for Alignment and Governance Atoosa Kasirzadeh Iason Gabriel 44 0 0 30 Apr 2025
In defence of post-hoc explanations in medical AI Joshua Hatherley Lauritz Munch Jens Christian Bjerring 24 0 0 29 Apr 2025
Studying Small Language Models with Susceptibilities Garrett Baker George Wang Jesse Hoogland Daniel Murfet AAML 73 1 0 25 Apr 2025
Decoding Vision Transformers: the Diffusion Steering Lens Ryota Takatsuki Sonia Joseph Ippei Fujisawa Ryota Kanai DiffM 30 0 0 18 Apr 2025
I'm Sorry Dave: How the old world of personnel security can inform the new world of AI insider risk Paul Martin Sarah Mercer 60 0 0 26 Mar 2025
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation Bowen Baker Joost Huizinga Leo Gao Zehao Dou M. Guan Aleksander Mądry Wojciech Zaremba J. Pachocki David Farhi LRM 62 11 0 14 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models Thomas Winninger Boussad Addad Katarzyna Kapusta AAML 61 0 0 08 Mar 2025
Triple Phase Transitions: Understanding the Learning Dynamics of Large Language Models from a Neuroscience Perspective Yuko Nakagi Keigo Tada Sota Yoshino Shinji Nishimoto Yu Takagi LRM 37 0 0 28 Feb 2025
Beyond Release: Access Considerations for Generative AI Systems Irene Solaiman Rishi Bommasani Dan Hendrycks Ariel Herbert-Voss Yacine Jernite Aviya Skowron Andrew Trask 54 1 0 23 Feb 2025
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Shichang Zhang Tessa Han Usha Bhalla Hima Lakkaraju FAtt 145 0 0 17 Feb 2025
Deciphering Functions of Neurons in Vision-Language Models Jiaqi Xu Cuiling Lan Xuejin Chen Yan Lu VLM 72 0 0 10 Feb 2025
Modular Training of Neural Networks aids Interpretability Satvik Golechha Maheep Chaudhary Joan Velja Alessandro Abate Nandi Schoots 74 0 0 04 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 49 1 0 09 Jan 2025
Improving Object Detection by Modifying Synthetic Data with Explainable AI Nitish Mital Simon Malzard Richard Walters Celso M. De Melo Raghuveer Rao Victoria Nockles 74 0 0 02 Dec 2024
Evaluating Creative Short Story Generation in Humans and Large Language Models Mete Ismayilzada Claire Stevenson Lonneke van der Plas LM&MA LRM 30 3 0 04 Nov 2024
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks Samuele Poppi Zheng-Xin Yong Yifei He Bobbie Chern Han Zhao Aobo Yang Jianfeng Chi AAML 43 11 0 23 Oct 2024
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering Yu Zhao Alessio Devoto Giwon Hong Xiaotang Du Aryo Pradipta Gema Hongru Wang Xuanli He Kam-Fai Wong Pasquale Minervini KELM LLMSV 32 14 0 21 Oct 2024
On the Role of Attention Heads in Large Language Model Safety Z. Zhou Haiyang Yu Xinghua Zhang Rongwu Xu Fei Huang Kun Wang Yang Liu Junfeng Fang Yongbin Li 45 5 0 17 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 41 7 0 10 Oct 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 54 18 0 02 Jul 2024
Generative AI Systems: A Systems-based Perspective on Generative AI Jakub M. Tomczak 37 1 0 25 Jun 2024
Review and Prospect of Algebraic Research in Equivalent Framework between Statistical Mechanics and Machine Learning Theory Sumio Watanabe 20 0 0 31 May 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems David Dalrymple Joar Skalse Yoshua Bengio Stuart J. Russell Max Tegmark ... Clark Barrett Ding Zhao Zhi-Xuan Tan Jeannette Wing Joshua Tenenbaum 44 51 0 10 May 2024
Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge Narek Maloyan Ekansh Verma Bulat Nutfullin Bislan Ashinov 41 2 0 21 Apr 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 41 42 0 01 Mar 2024
On the Challenges and Opportunities in Generative AI Laura Manduchi Kushagra Pandey Robert Bamler Ryan Cotterell Sina Daubener ... F. Wenzel Frank Wood Stephan Mandt Vincent Fortuin Vincent Fortuin 42 17 0 28 Feb 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Jing-ling Huang Zhengxuan Wu Christopher Potts Mor Geva Atticus Geiger 53 24 0 27 Feb 2024
Information Flow Routes: Automatically Interpreting Language Models at Scale Javier Ferrando Elena Voita 40 10 0 27 Feb 2024
Robust agents learn causal world models Jonathan G. Richens Tom Everitt OOD 111 18 0 16 Feb 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits Philip Quirke Clement Neo Fazl Barez KELM LRM 32 3 0 04 Feb 2024
Universal Neurons in GPT2 Language Models Wes Gurnee Theo Horsley Zifan Carl Guo Tara Rezaei Kheirkhah Qinyi Sun Will Hathaway Neel Nanda Dimitris Bertsimas MILM 86 37 0 22 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Andrew Lee Xiaoyan Bai Itamar Pres Martin Wattenberg Jonathan K. Kummerfeld Rada Mihalcea 55 95 0 03 Jan 2024
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models Alexandre Variengien Eric Winsor LRM ReLM 72 5 0 13 Dec 2023
Structured World Representations in Maze-Solving Transformers Michael I. Ivanitskiy Alex F Spies Tilman Rauker Guillaume Corlouer Chris Mathwin ... Rusheb Shah Dan Valentine Cecilia G. Diniz Behn Katsumi Inoue Samy Wu Fung 49 2 0 05 Dec 2023
Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism Mansi Sakarvadia Arham Khan Aswathy Ajith Daniel Grzenda Nathaniel Hudson André Bauer Kyle Chard Ian T. Foster 170 9 0 25 Oct 2023
Characterizing Mechanisms for Factual Recall in Language Models Qinan Yu Jack Merullo Ellie Pavlick KELM 35 23 0 24 Oct 2023
Towards Understanding Sycophancy in Language Models Mrinank Sharma Meg Tong Tomasz Korbak D. Duvenaud Amanda Askell ... Oliver Rausch Nicholas Schiefer Da Yan Miranda Zhang Ethan Perez 209 178 0 20 Oct 2023
Attribution Patching Outperforms Automated Circuit Discovery Aaquib Syed Can Rager Arthur Conmy 55 53 0 16 Oct 2023
Growing Brains: Co-emergence of Anatomical and Functional Modularity in Recurrent Neural Networks Ziming Liu Mikail Khona Ila R. Fiete Max Tegmark 18 12 0 11 Oct 2023
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l James Dao Yeu-Tong Lau Can Rager Jett Janiak 27 4 0 11 Oct 2023
Can LLMs facilitate interpretation of pre-trained language models? Basel Mousi Nadir Durrani Fahim Dalvi 28 12 0 22 May 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 153 170 0 02 May 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 178 116 0 30 Apr 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 189 260 0 28 Apr 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4 Sébastien Bubeck Varun Chandrasekaran Ronen Eldan J. Gehrke Eric Horvitz ... Scott M. Lundberg Harsha Nori Hamid Palangi Marco Tulio Ribeiro Yi Zhang ELM AI4MH AI4CE ALM 197 2,232 0 22 Mar 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 486 0 01 Nov 2022