ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.05147
  4. Cited By
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

9 August 2024
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
ArXivPDFHTML

Papers citing "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2"

50 / 66 papers shown
Title
Are Sparse Autoencoders Useful for Java Function Bug Detection?
Are Sparse Autoencoders Useful for Java Function Bug Detection?
Rui Melo
Claudia Mamede
Andre Catarino
Rui Abreu
Henrique Lopes Cardoso
14
0
0
15 May 2025
An Introduction to Discrete Variational Autoencoders
An Introduction to Discrete Variational Autoencoders
Alan Jeffares
Liyuan Liu
DRL
BDL
CML
27
0
0
15 May 2025
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
Dong Shu
Xuansheng Wu
Haiyan Zhao
Mengnan Du
Ninghao Liu
LLMSV
37
0
0
12 May 2025
Patterns and Mechanisms of Contrastive Activation Engineering
Patterns and Mechanisms of Contrastive Activation Engineering
Yixiong Hao
Ayush Panda
Stepan Shabalin
Sheikh Abdur Raheem Ali
LLMSV
60
0
0
06 May 2025
Empirical Evaluation of Progressive Coding for Sparse Autoencoders
Empirical Evaluation of Progressive Coding for Sparse Autoencoders
Hans Peter
Anders Søgaard
36
0
0
30 Apr 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Zhengfu He
J. Wang
Rui Lin
Xuyang Ge
Wentao Shu
Qiong Tang
J. Zhang
Xipeng Qiu
70
0
0
29 Apr 2025
Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption
Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption
Wenxiao Wang
Parsa Hosseini
S. Feizi
LRM
AI4CE
53
0
0
29 Apr 2025
Representation Learning on a Random Lattice
Representation Learning on a Random Lattice
Aryeh Brill
OOD
FAtt
AI4CE
73
0
0
28 Apr 2025
Scaling sparse feature circuit finding for in-context learning
Scaling sparse feature circuit finding for in-context learning
Dmitrii Kharlapenko
S. Kamath S
Fazl Barez
Arthur Conmy
Neel Nanda
26
0
0
18 Apr 2025
MIB: A Mechanistic Interpretability Benchmark
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
41
1
0
17 Apr 2025
SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
SAEs Can\textit{Can}Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
Aashiq Muhamed
Jacopo Bonato
Mona Diab
Virginia Smith
MU
58
0
0
11 Apr 2025
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric
Yixin Cao
Jiahao Ying
Y. Wang
Xipeng Qiu
Xuanjing Huang
Yugang Jiang
ELM
30
2
0
10 Apr 2025
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Simon Lermen
Mateusz Dziemian
Natalia Pérez-Campanero Antolín
31
0
0
10 Apr 2025
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Sewoong Lee
Adam Davies
Marc E. Canby
J. Hockenmaier
LLMSV
65
0
0
31 Mar 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
Bart Bussmann
Noa Nabeshima
Adam Karvonen
Neel Nanda
54
0
0
21 Mar 2025
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Adam Karvonen
34
0
0
21 Mar 2025
Towards LLM Guardrails via Sparse Representation Steering
Towards LLM Guardrails via Sparse Representation Steering
Zeqing He
Zhibo Wang
Huiyu Xu
Kui Ren
LLMSV
52
1
0
21 Mar 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Adam Karvonen
Can Rager
Johnny Lin
Curt Tigges
Joseph Isaac Bloom
...
Kola Ayonrinde
Matthew Wearden
Arthur Conmy
Samuel Marks
Neel Nanda
MU
55
8
0
12 Mar 2025
Mixture of Experts Made Intrinsically Interpretable
Xingyi Yang
Constantin Venhoff
Ashkan Khakzar
Christian Schroeder de Witt
P. Dokania
Adel Bibi
Philip H. S. Torr
MoE
49
0
0
05 Mar 2025
Towards Understanding Distilled Reasoning Models: A Representational Approach
Towards Understanding Distilled Reasoning Models: A Representational Approach
David D. Baek
Max Tegmark
LRM
75
2
0
05 Mar 2025
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
Kristian Kuznetsov
Laida Kushnareva
Polina Druzhinina
Anton Razzhigaev
Anastasia Voznyuk
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
34
0
0
05 Mar 2025
SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs
Samir Abdaljalil
Filippo Pallucchini
Andrea Seveso
Hasan Kurban
Fabio Mercorio
Erchin Serpedin
HILM
72
0
0
04 Mar 2025
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry
Sai Sumedh R. Hindupur
Ekdeep Singh Lubana
Thomas Fel
Demba Ba
37
4
0
03 Mar 2025
Interpreting CLIP with Hierarchical Sparse Autoencoders
Interpreting CLIP with Hierarchical Sparse Autoencoders
Vladimir Zaigrajew
Hubert Baniecki
P. Biecek
49
0
0
27 Feb 2025
Do Sparse Autoencoders Generalize? A Case Study of Answerability
Do Sparse Autoencoders Generalize? A Case Study of Answerability
Lovis Heindrich
Philip H. S. Torr
Fazl Barez
Veronika Thost
69
1
0
27 Feb 2025
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
Xuansheng Wu
Jiayi Yuan
Wenlin Yao
Xiaoming Zhai
Ninghao Liu
LLMSV
78
4
0
24 Feb 2025
FADE: Why Bad Descriptions Happen to Good Features
FADE: Why Bad Descriptions Happen to Good Features
Bruno Puri
Aakriti Jain
Elena Golimblevskaia
Patrick Kahardipraja
Thomas Wiegand
Wojciech Samek
Sebastian Lapuschkin
100
0
0
24 Feb 2025
The Knowledge Microscope: Features as Better Analytical Lenses than Neurons
The Knowledge Microscope: Features as Better Analytical Lenses than Neurons
Yuheng Chen
Pengfei Cao
Kang Liu
Jun Zhao
43
0
0
18 Feb 2025
Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards
Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards
Xinyi Yang
Liang Zeng
Heng Dong
C. Yu
X. Wu
H. Yang
Yu Wang
Milind Tambe
Tonghan Wang
68
2
0
18 Feb 2025
The Multi-Faceted Monosemanticity in Multimodal Representations
The Multi-Faceted Monosemanticity in Multimodal Representations
Hanqi Yan
Xiangxiang Cui
Lu Yin
Paul Pu Liang
Yulan He
Yifei Wang
44
0
0
16 Feb 2025
SEER: Self-Explainability Enhancement of Large Language Models' Representations
SEER: Self-Explainability Enhancement of Large Language Models' Representations
Guanxu Chen
Dongrui Liu
Tao Luo
Jing Shao
LRM
MILM
65
1
0
07 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Minegishi
Hiroki Furuta
Yusuke Iwasawa
Y. Matsuo
49
1
0
09 Jan 2025
Can Input Attributions Interpret the Inductive Reasoning Process Elicited in In-Context Learning?
Can Input Attributions Interpret the Inductive Reasoning Process Elicited in In-Context Learning?
Mengyu Ye
Tatsuki Kuribayashi
Goro Kobayashi
Jun Suzuki
LRM
87
0
0
20 Dec 2024
VISTA: A Panoramic View of Neural Representations
VISTA: A Panoramic View of Neural Representations
Tom White
62
0
0
03 Dec 2024
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Arithmetic Reasoning
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Arithmetic Reasoning
Keito Kudo
Yoichi Aoki
Tatsuki Kuribayashi
Shusaku Sone
Masaya Taniguchi
Ana Brassard
Keisuke Sakaguchi
Kentaro Inui
ReLM
LRM
69
0
0
02 Dec 2024
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando
Oscar Obeso
Senthooran Rajamanoharan
Neel Nanda
75
10
0
21 Nov 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders
Charles OÑeill
David Klindt
David Klindt
84
1
0
20 Nov 2024
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering
  in LLMs
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs
Ruben Härle
Felix Friedrich
Manuel Brack
Bjorn Deiseroth
P. Schramowski
Kristian Kersting
23
0
0
11 Nov 2024
Beyond Toxic Neurons: A Mechanistic Analysis of DPO for Toxicity
  Reduction
Beyond Toxic Neurons: A Mechanistic Analysis of DPO for Toxicity Reduction
Yushi Yang
Filip Sondej
Harry Mayne
Adam Mahdi
24
0
0
10 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention
Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla
Suraj Srinivas
Asma Ghandeharioun
Himabindu Lakkaraju
38
5
0
07 Nov 2024
Improving Steering Vectors by Targeting Sparse Autoencoder Features
Improving Steering Vectors by Targeting Sparse Autoencoder Features
Sviatoslav Chalnev
Matthew Siu
Arthur Conmy
LLMSV
47
16
0
04 Nov 2024
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting
  Rare Concepts in Foundation Models
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
Aashiq Muhamed
Mona Diab
Virginia Smith
38
2
0
01 Nov 2024
Efficient Training of Sparse Autoencoders for Large Language Models via
  Layer Groups
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
Davide Ghilardi
Federico Belotti
Marco Molinari
21
2
0
28 Oct 2024
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse
  Autoencoders
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
Viacheslav Surkov
Chris Wendler
Mikhail Terekhov
Justin Deschenaux
Robert West
Çağlar Gülçehre
VLM
40
13
0
28 Oct 2024
Beyond Interpretability: The Gains of Feature Monosemanticity on Model
  Robustness
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness
Qi Zhang
Yifei Wang
Jingyi Cui
Xiang Pan
Qi Lei
Stefanie Jegelka
Yisen Wang
AAML
34
1
0
27 Oct 2024
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with
  Sparse Autoencoders
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Zhengfu He
Wentao Shu
Xuyang Ge
Lingjie Chen
Junxuan Wang
...
Qipeng Guo
Xuanjing Huang
Zuxuan Wu
Yu Jiang
Xipeng Qiu
21
13
0
27 Oct 2024
Applying sparse autoencoders to unlearn knowledge in language models
Applying sparse autoencoders to unlearn knowledge in language models
Eoin Farrell
Yeu-Tong Lau
Arthur Conmy
MU
29
14
0
25 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders
Decomposing The Dark Matter of Sparse Autoencoders
Joshua Engels
Logan Riggs
Max Tegmark
LLMSV
55
9
0
18 Oct 2024
The Persian Rug: solving toy models of superposition using large-scale
  symmetries
The Persian Rug: solving toy models of superposition using large-scale symmetries
Aditya Cowsik
Kfir Dolev
Alex Infanger
19
0
0
15 Oct 2024
Analyzing (In)Abilities of SAEs via Formal Languages
Analyzing (In)Abilities of SAEs via Formal Languages
Abhinav Menon
Manish Shrivastava
David M. Krueger
Ekdeep Singh Lubana
42
7
0
15 Oct 2024
12
Next