ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.14082
  4. Cited By
Mechanistic Interpretability for AI Safety -- A Review

Mechanistic Interpretability for AI Safety -- A Review

22 April 2024
Leonard Bereska
E. Gavves
    AI4CE
ArXivPDFHTML

Papers citing "Mechanistic Interpretability for AI Safety -- A Review"

50 / 62 papers shown
Title
Geospatial Mechanistic Interpretability of Large Language Models
Geospatial Mechanistic Interpretability of Large Language Models
Stef De Sabbata
Stefano Mizzaro
Kevin Roitero
AI4CE
21
0
0
06 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde
Louis Jaburi
XAI
55
1
0
02 May 2025
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
Kola Ayonrinde
Louis Jaburi
MILM
68
1
0
01 May 2025
Characterizing AI Agents for Alignment and Governance
Characterizing AI Agents for Alignment and Governance
Atoosa Kasirzadeh
Iason Gabriel
44
0
0
30 Apr 2025
In defence of post-hoc explanations in medical AI
In defence of post-hoc explanations in medical AI
Joshua Hatherley
Lauritz Munch
Jens Christian Bjerring
24
0
0
29 Apr 2025
Studying Small Language Models with Susceptibilities
Studying Small Language Models with Susceptibilities
Garrett Baker
George Wang
Jesse Hoogland
Daniel Murfet
AAML
73
1
0
25 Apr 2025
Decoding Vision Transformers: the Diffusion Steering Lens
Decoding Vision Transformers: the Diffusion Steering Lens
Ryota Takatsuki
Sonia Joseph
Ippei Fujisawa
Ryota Kanai
DiffM
30
0
0
18 Apr 2025
I'm Sorry Dave: How the old world of personnel security can inform the new world of AI insider risk
I'm Sorry Dave: How the old world of personnel security can inform the new world of AI insider risk
Paul Martin
Sarah Mercer
60
0
0
26 Mar 2025
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Bowen Baker
Joost Huizinga
Leo Gao
Zehao Dou
M. Guan
Aleksander Mądry
Wojciech Zaremba
J. Pachocki
David Farhi
LRM
62
11
0
14 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
61
0
0
08 Mar 2025
Triple Phase Transitions: Understanding the Learning Dynamics of Large Language Models from a Neuroscience Perspective
Triple Phase Transitions: Understanding the Learning Dynamics of Large Language Models from a Neuroscience Perspective
Yuko Nakagi
Keigo Tada
Sota Yoshino
Shinji Nishimoto
Yu Takagi
LRM
37
0
0
28 Feb 2025
Beyond Release: Access Considerations for Generative AI Systems
Beyond Release: Access Considerations for Generative AI Systems
Irene Solaiman
Rishi Bommasani
Dan Hendrycks
Ariel Herbert-Voss
Yacine Jernite
Aviya Skowron
Andrew Trask
54
1
0
23 Feb 2025
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
Shichang Zhang
Tessa Han
Usha Bhalla
Hima Lakkaraju
FAtt
145
0
0
17 Feb 2025
Deciphering Functions of Neurons in Vision-Language Models
Deciphering Functions of Neurons in Vision-Language Models
Jiaqi Xu
Cuiling Lan
Xuejin Chen
Yan Lu
VLM
72
0
0
10 Feb 2025
Modular Training of Neural Networks aids Interpretability
Modular Training of Neural Networks aids Interpretability
Satvik Golechha
Maheep Chaudhary
Joan Velja
Alessandro Abate
Nandi Schoots
74
0
0
04 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Minegishi
Hiroki Furuta
Yusuke Iwasawa
Y. Matsuo
49
1
0
09 Jan 2025
Improving Object Detection by Modifying Synthetic Data with Explainable AI
Improving Object Detection by Modifying Synthetic Data with Explainable AI
Nitish Mital
Simon Malzard
Richard Walters
Celso M. De Melo
Raghuveer Rao
Victoria Nockles
74
0
0
02 Dec 2024
Evaluating Creative Short Story Generation in Humans and Large Language Models
Evaluating Creative Short Story Generation in Humans and Large Language Models
Mete Ismayilzada
Claire Stevenson
Lonneke van der Plas
LM&MA
LRM
30
3
0
04 Nov 2024
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
Samuele Poppi
Zheng-Xin Yong
Yifei He
Bobbie Chern
Han Zhao
Aobo Yang
Jianfeng Chi
AAML
43
11
0
23 Oct 2024
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Yu Zhao
Alessio Devoto
Giwon Hong
Xiaotang Du
Aryo Pradipta Gema
Hongru Wang
Xuanli He
Kam-Fai Wong
Pasquale Minervini
KELM
LLMSV
32
14
0
21 Oct 2024
On the Role of Attention Heads in Large Language Model Safety
On the Role of Attention Heads in Large Language Model Safety
Z. Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Junfeng Fang
Yongbin Li
45
5
0
17 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
41
7
0
10 Oct 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
54
18
0
02 Jul 2024
Generative AI Systems: A Systems-based Perspective on Generative AI
Generative AI Systems: A Systems-based Perspective on Generative AI
Jakub M. Tomczak
37
1
0
25 Jun 2024
Review and Prospect of Algebraic Research in Equivalent Framework between Statistical Mechanics and Machine Learning Theory
Review and Prospect of Algebraic Research in Equivalent Framework between Statistical Mechanics and Machine Learning Theory
Sumio Watanabe
20
0
0
31 May 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable
  AI Systems
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
David Dalrymple
Joar Skalse
Yoshua Bengio
Stuart J. Russell
Max Tegmark
...
Clark Barrett
Ding Zhao
Zhi-Xuan Tan
Jeannette Wing
Joshua Tenenbaum
44
51
0
10 May 2024
Trojan Detection in Large Language Models: Insights from The Trojan
  Detection Challenge
Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
Narek Maloyan
Ekansh Verma
Bulat Nutfullin
Bislan Ashinov
41
2
0
21 Apr 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to
  components
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
41
42
0
01 Mar 2024
On the Challenges and Opportunities in Generative AI
On the Challenges and Opportunities in Generative AI
Laura Manduchi
Kushagra Pandey
Robert Bamler
Ryan Cotterell
Sina Daubener
...
F. Wenzel
Frank Wood
Stephan Mandt
Vincent Fortuin
Vincent Fortuin
42
17
0
28 Feb 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language
  Model Representations
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
53
24
0
27 Feb 2024
Information Flow Routes: Automatically Interpreting Language Models at
  Scale
Information Flow Routes: Automatically Interpreting Language Models at Scale
Javier Ferrando
Elena Voita
40
10
0
27 Feb 2024
Robust agents learn causal world models
Robust agents learn causal world models
Jonathan G. Richens
Tom Everitt
OOD
111
18
0
16 Feb 2024
Increasing Trust in Language Models through the Reuse of Verified
  Circuits
Increasing Trust in Language Models through the Reuse of Verified Circuits
Philip Quirke
Clement Neo
Fazl Barez
KELM
LRM
32
3
0
04 Feb 2024
Universal Neurons in GPT2 Language Models
Universal Neurons in GPT2 Language Models
Wes Gurnee
Theo Horsley
Zifan Carl Guo
Tara Rezaei Kheirkhah
Qinyi Sun
Will Hathaway
Neel Nanda
Dimitris Bertsimas
MILM
86
37
0
22 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO
  and Toxicity
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
55
95
0
03 Jan 2024
Look Before You Leap: A Universal Emergent Decomposition of Retrieval
  Tasks in Language Models
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien
Eric Winsor
LRM
ReLM
72
5
0
13 Dec 2023
Structured World Representations in Maze-Solving Transformers
Structured World Representations in Maze-Solving Transformers
Michael I. Ivanitskiy
Alex F Spies
Tilman Rauker
Guillaume Corlouer
Chris Mathwin
...
Rusheb Shah
Dan Valentine
Cecilia G. Diniz Behn
Katsumi Inoue
Samy Wu Fung
49
2
0
05 Dec 2023
Attention Lens: A Tool for Mechanistically Interpreting the Attention
  Head Information Retrieval Mechanism
Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism
Mansi Sakarvadia
Arham Khan
Aswathy Ajith
Daniel Grzenda
Nathaniel Hudson
André Bauer
Kyle Chard
Ian T. Foster
170
9
0
25 Oct 2023
Characterizing Mechanisms for Factual Recall in Language Models
Characterizing Mechanisms for Factual Recall in Language Models
Qinan Yu
Jack Merullo
Ellie Pavlick
KELM
35
23
0
24 Oct 2023
Towards Understanding Sycophancy in Language Models
Towards Understanding Sycophancy in Language Models
Mrinank Sharma
Meg Tong
Tomasz Korbak
D. Duvenaud
Amanda Askell
...
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
209
178
0
20 Oct 2023
Attribution Patching Outperforms Automated Circuit Discovery
Attribution Patching Outperforms Automated Circuit Discovery
Aaquib Syed
Can Rager
Arthur Conmy
55
53
0
16 Oct 2023
Growing Brains: Co-emergence of Anatomical and Functional Modularity in
  Recurrent Neural Networks
Growing Brains: Co-emergence of Anatomical and Functional Modularity in Recurrent Neural Networks
Ziming Liu
Mikail Khona
Ila R. Fiete
Max Tegmark
18
12
0
11 Oct 2023
An Adversarial Example for Direct Logit Attribution: Memory Management
  in gelu-4l
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l
James Dao
Yeu-Tong Lau
Can Rager
Jett Janiak
27
4
0
11 Oct 2023
Can LLMs facilitate interpretation of pre-trained language models?
Can LLMs facilitate interpretation of pre-trained language models?
Basel Mousi
Nadir Durrani
Fahim Dalvi
28
12
0
22 May 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
153
170
0
02 May 2023
How does GPT-2 compute greater-than?: Interpreting mathematical
  abilities in a pre-trained language model
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna
Ollie Liu
Alexandre Variengien
LRM
178
116
0
30 Apr 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language
  Models
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
189
260
0
28 Apr 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
197
2,232
0
22 Mar 2023
Finding Alignments Between Interpretable Causal Variables and
  Distributed Neural Representations
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas F. Icard
Noah D. Goodman
CML
73
98
0
05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
486
0
01 Nov 2022
12
Next