v1v2 (latest)

SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

22 May 2025

ArXiv (abs)PDF HTML Github (4★)

Papers citing "SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models"

15 / 15 papers shown

SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

105

25 Nov 2025

SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

Wei Xia

Zhi-Hong Deng

ALM

266

20 Nov 2025

Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement

176

28 Sep 2025

Evaluating Sparse Autoencoders for Monosemantic Representation

Moghis Fereidouni

Muhammad Umair Haider

Peizhong Ju

A.B. Siddique

136

20 Aug 2025

Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder

394

25 Jun 2025

Improving LLM Reasoning through Interpretable Role-Playing Steering

219

09 Jun 2025

FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation SteeringAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

317

20 Apr 2025

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

430

24 Feb 2025

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Subhash Kantamneni

Joshua Engels

Senthooran Rajamanoharan

Max Tegmark

Neel Nanda

348

23 Feb 2025

Sparse Autoencoder Features for Classifications and Transferability

Danielle S. Bitterman

283

17 Feb 2025

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

298

17 Feb 2025

A Unified Understanding and Evaluation of Steering Methods

Shawn Im

Yixuan Li

LLMSV

268

04 Feb 2025

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Javier Ferrando

Oscar Obeso

Senthooran Rajamanoharan

Neel Nanda

484

21 Nov 2024

Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian DistributionInternational Conference on Learning Representations (ICLR), 2024

418

30 Sep 2024

Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories

424

26 May 2024