ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.08600
  4. Cited By
Sparse Autoencoders Find Highly Interpretable Features in Language
  Models

Sparse Autoencoders Find Highly Interpretable Features in Language Models

15 September 2023
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
    MILM
ArXivPDFHTML

Papers citing "Sparse Autoencoders Find Highly Interpretable Features in Language Models"

49 / 49 papers shown
Title
Demystifying optimized prompts in language models
Demystifying optimized prompts in language models
Rimon Melamed
Lucas H. McCabe
H. H. Huang
31
0
0
04 May 2025
Interpreting Multilingual and Document-Length Sensitive Relevance Computations in Neural Retrieval Models through Axiomatic Causal Interventions
Interpreting Multilingual and Document-Length Sensitive Relevance Computations in Neural Retrieval Models through Axiomatic Causal Interventions
Oliver Savolainen
Dur e Najaf Amjad
Roxana Petcu
AAML
14
0
0
04 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde
Louis Jaburi
XAI
52
1
0
02 May 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Zhengfu He
J. Wang
Rui Lin
Xuyang Ge
Wentao Shu
Qiong Tang
J. Zhang
Xipeng Qiu
68
0
0
29 Apr 2025
Improving Reasoning Performance in Large Language Models via Representation Engineering
Improving Reasoning Performance in Large Language Models via Representation Engineering
Bertram Højer
Oliver Jarvis
Stefan Heinrich
LRM
67
1
0
28 Apr 2025
Decoding Vision Transformers: the Diffusion Steering Lens
Decoding Vision Transformers: the Diffusion Steering Lens
Ryota Takatsuki
Sonia Joseph
Ippei Fujisawa
Ryota Kanai
DiffM
23
0
0
18 Apr 2025
Towards Combinatorial Interpretability of Neural Computation
Towards Combinatorial Interpretability of Neural Computation
Micah Adler
Dan Alistarh
Nir Shavit
FAtt
78
1
0
10 Apr 2025
Effective Skill Unlearning through Intervention and Abstention
Effective Skill Unlearning through Intervention and Abstention
Yongce Li
Chung-En Sun
Tsui-Wei Weng
MU
43
0
0
27 Mar 2025
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Adam Karvonen
26
0
0
21 Mar 2025
Representational Similarity via Interpretable Visual Concepts
Representational Similarity via Interpretable Visual Concepts
Neehar Kondapaneni
Oisin Mac Aodha
Pietro Perona
DRL
57
0
0
19 Mar 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun
Jing Huang
Sidharth Baskaran
Karel DÓosterlinck
Christopher Potts
Michael Sklar
Atticus Geiger
AI4CE
55
0
0
13 Mar 2025
Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models
Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models
Zhihua Tian
Sirun Nan
Ming Xu
Shengfang Zhai
Wenjie Qu
Jian Liu
Kui Ren
Ruoxi Jia
Jiaheng Zhang
DiffM
65
1
0
12 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
59
0
0
08 Mar 2025
Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking
Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking
Yifan Zhang
Wenyu Du
Dongming Jin
Jie Fu
Zhi Jin
LRM
46
0
0
27 Feb 2025
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Anton Razzhigaev
Matvey Mikhalchuk
Temurbek Rahmatullaev
Elizaveta Goncharova
Polina Druzhinina
Ivan V. Oseledets
Andrey Kuznetsov
48
1
0
20 Feb 2025
The Knowledge Microscope: Features as Better Analytical Lenses than Neurons
The Knowledge Microscope: Features as Better Analytical Lenses than Neurons
Yuheng Chen
Pengfei Cao
Kang Liu
Jun Zhao
40
0
0
18 Feb 2025
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
Shichang Zhang
Tessa Han
Usha Bhalla
Hima Lakkaraju
FAtt
143
0
0
17 Feb 2025
The Complexity of Learning Sparse Superposed Features with Feedback
The Complexity of Learning Sparse Superposed Features with Feedback
Akash Kumar
50
0
0
08 Feb 2025
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment
Harrish Thasarathan
Julian Forsyth
Thomas Fel
M. Kowal
Konstantinos G. Derpanis
81
7
0
06 Feb 2025
Modular Training of Neural Networks aids Interpretability
Modular Training of Neural Networks aids Interpretability
Satvik Golechha
Maheep Chaudhary
Joan Velja
Alessandro Abate
Nandi Schoots
71
0
0
04 Feb 2025
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
Bartosz Cywiñski
Kamil Deja
DiffM
52
6
0
29 Jan 2025
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages
Jannik Brinkmann
Chris Wendler
Christian Bartelt
Aaron Mueller
38
9
0
10 Jan 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Minegishi
Hiroki Furuta
Yusuke Iwasawa
Y. Matsuo
40
1
0
09 Jan 2025
Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers
Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers
Markus J. Buehler
AI4CE
33
1
0
04 Jan 2025
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
Yang Xu
Y. Wang
Hao Wang
46
1
0
23 Dec 2024
Transformers Use Causal World Models in Maze-Solving Tasks
Transformers Use Causal World Models in Maze-Solving Tasks
Alex F Spies
William Edwards
Michael I. Ivanitskiy
Adrians Skapars
Tilman Rauker
Katsumi Inoue
A. Russo
Murray Shanahan
84
1
0
16 Dec 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders
Charles OÑeill
David Klindt
David Klindt
64
1
0
20 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Zhibo Wang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
36
3
0
17 Nov 2024
A Mechanistic Explanatory Strategy for XAI
A Mechanistic Explanatory Strategy for XAI
Marcin Rabiza
30
1
0
02 Nov 2024
Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning
Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning
John Wu
David Wu
Jimeng Sun
20
1
0
31 Oct 2024
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Yu Zhao
Alessio Devoto
Giwon Hong
Xiaotang Du
Aryo Pradipta Gema
Hongru Wang
Xuanli He
Kam-Fai Wong
Pasquale Minervini
KELM
LLMSV
27
14
0
21 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders
Decomposing The Dark Matter of Sparse Autoencoders
Joshua Engels
Logan Riggs
Max Tegmark
LLMSV
37
9
0
18 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering
Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
40
13
0
15 Oct 2024
Analyzing (In)Abilities of SAEs via Formal Languages
Analyzing (In)Abilities of SAEs via Formal Languages
Abhinav Menon
Manish Shrivastava
David M. Krueger
Ekdeep Singh Lubana
32
7
0
15 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
28
7
0
10 Oct 2024
Towards Interpreting Visual Information Processing in Vision-Language Models
Towards Interpreting Visual Information Processing in Vision-Language Models
Clement Neo
Luke Ong
Philip H. S. Torr
Mor Geva
David M. Krueger
Fazl Barez
78
6
0
09 Oct 2024
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning
Wei Chen
Zhen Huang
Liang Xie
Binbin Lin
Houqiang Li
...
Deng Cai
Yonggang Zhang
Wenxiao Wang
Xu Shen
Jieping Ye
25
6
0
03 Sep 2024
Knowledge in Superposition: Unveiling the Failures of Lifelong Knowledge Editing for Large Language Models
Knowledge in Superposition: Unveiling the Failures of Lifelong Knowledge Editing for Large Language Models
Chenhui Hu
Pengfei Cao
Yubo Chen
Kang Liu
Jun Zhao
KELM
41
2
0
14 Aug 2024
Representing Rule-based Chatbots with Transformers
Representing Rule-based Chatbots with Transformers
Dan Friedman
Abhishek Panigrahi
Danqi Chen
44
1
0
15 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
49
18
0
02 Jul 2024
Finding Transformer Circuits with Edge Pruning
Finding Transformer Circuits with Edge Pruning
Adithya Bhaskar
Alexander Wettig
Dan Friedman
Danqi Chen
44
14
0
24 Jun 2024
Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs
Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs
D. Yaldiz
Yavuz Faruk Bakman
Baturalp Buyukates
Chenyang Tao
Anil Ramakrishna
Dimitrios Dimitriadis
Jieyu Zhao
Salman Avestimehr
30
1
0
17 Jun 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Jack Merullo
Carsten Eickhoff
Ellie Pavlick
38
2
0
13 Jun 2024
Knowledge Circuits in Pretrained Transformers
Knowledge Circuits in Pretrained Transformers
Yunzhi Yao
Ningyu Zhang
Zekun Xi
Meng Wang
Ziwen Xu
Shumin Deng
Huajun Chen
KELM
45
19
0
28 May 2024
KAN: Kolmogorov-Arnold Networks
KAN: Kolmogorov-Arnold Networks
Ziming Liu
Yixuan Wang
Sachin Vaidya
Fabian Ruehle
James Halverson
Marin Soljacic
Thomas Y. Hou
Max Tegmark
32
444
0
30 Apr 2024
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
205
486
0
01 Nov 2022
Toy Models of Superposition
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
117
314
0
21 Sep 2022
Linear Adversarial Concept Erasure
Linear Adversarial Concept Erasure
Shauli Ravfogel
Michael Twiton
Yoav Goldberg
Ryan Cotterell
KELM
62
56
0
28 Jan 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
236
1,508
0
31 Dec 2020
1