v1v2 (latest)

SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

22 May 2025

ArXiv (abs)PDF HTML Github (4★)

Papers citing "SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models"

15 / 15 papers shown

SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

106

25 Nov 2025

SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

Wei Xia

Zhi-Hong Deng

ALM

271

20 Nov 2025

Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement

179

28 Sep 2025

Evaluating Sparse Autoencoders for Monosemantic Representation

Moghis Fereidouni

Muhammad Umair Haider

Peizhong Ju

A.B. Siddique

137

20 Aug 2025

Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder

398

25 Jun 2025

Improving LLM Reasoning through Interpretable Role-Playing Steering

229

09 Jun 2025

FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation SteeringAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

320

20 Apr 2025

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

430

24 Feb 2025

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Subhash Kantamneni

Joshua Engels

Senthooran Rajamanoharan

Max Tegmark

Neel Nanda

356

23 Feb 2025

Sparse Autoencoder Features for Classifications and Transferability

Danielle S. Bitterman

284

17 Feb 2025

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

305

17 Feb 2025

A Unified Understanding and Evaluation of Steering Methods

Shawn Im

Yixuan Li

LLMSV

277

04 Feb 2025

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Javier Ferrando

Oscar Obeso

Senthooran Rajamanoharan

Neel Nanda

498

21 Nov 2024

Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian DistributionInternational Conference on Learning Representations (ICLR), 2024

428

30 Sep 2024

Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories

436

26 May 2024