Improving Activation Steering in Language Models with Mean-Centring

6 December 2023

ArXiv (abs)PDF HTML Github

Papers citing "Improving Activation Steering in Language Models with Mean-Centring"

27 / 27 papers shown

To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models

286

15 Oct 2025

BILLY: Steering Large Language Models via Merging Persona Vectors for Creative Generation

278

11 Oct 2025

Multimodal Function Vectors for Spatial Relations

134

02 Oct 2025

Who is In Charge? Dissecting Role Conflicts in Instruction Following

Siqi Zeng

176

23 Sep 2025

ReCoVeR the Target Language: Language Steering without Sacrificing Task Performance

203

18 Sep 2025

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

310

07 Aug 2025

LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers

217

06 Jul 2025

From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers

Jingtong Su

Julia Kempe

Karen Ullrich

406

20 Jun 2025

Probing the Robustness of Large Language Models Safety to Latent Perturbations

326

19 Jun 2025

Linear Spatial World Models Emerge in Large Language Models

240

03 Jun 2025

IF-GUIDE: Influence Function-Guided Detoxification of LLMs

513

02 Jun 2025

SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

610

22 May 2025

Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

375

21 May 2025

Risk Assessment Framework for Code LLMs via Leveraging Internal States

279

20 Apr 2025

Representation Bending for Large Language Model SafetyAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

479

02 Apr 2025

Inference-Time Intervention in Large Language Models for Reliable Requirement Verification

Paul Darm

James Xie

A. Riccardi

232

18 Mar 2025

Towards Understanding Distilled Reasoning Models: A Representational Approach

David D. Baek

Max Tegmark

LRM

400

05 Mar 2025

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

378

17 Feb 2025

Designing Role Vectors to Improve LLM Inference Behaviour

330

17 Feb 2025

Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability-Oriented ApproachAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

358

19 Jan 2025

Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering

Joris Postmus

Steven Abreu

LLMSV

826

09 Oct 2024

Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian DistributionInternational Conference on Learning Representations (ICLR), 2024

517

30 Sep 2024

Programming Refusal with Conditional Activation SteeringInternational Conference on Learning Representations (ICLR), 2024

Bruce W. Lee

Inkit Padhi

Karthikeyan N. Ramamurthy

569

106

06 Sep 2024

Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories

531

26 May 2024

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Stephen Casper

Lennart Schulze

Oam Patel

Dylan Hadfield-Menell

AAML

817

08 Mar 2024

Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods

810

29 Jan 2024

LEACE: Perfect linear concept erasure in closed formNeural Information Processing Systems (NeurIPS), 2023

Nora Belrose

David Schneider-Joseph

937

189

06 Jun 2023