Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

v1v2 (latest)

Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

22 May 2025

Mostafa Elhoushi

ArXiv (abs)PDF HTML Github (30168★)

Papers citing "Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs"

8 / 8 papers shown

Title
The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features Jeremias Lino Ferrao Matthijs van der Lende Ilija Lichkovski Clement Neo LLMSV 60 0 0 16 Sep 2025
MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair Changqing Li Tianlin Li Xiaohan Zhang Aishan Liu Li Pan KELM LLMSV 48 0 0 09 Aug 2025
FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering Yongbin Li Zhiting Fan Ruizhe Chen Xiaotang Gai Luqi Gong Yan Zhang Zuozhu Liu LLMSV 161 8 0 20 Apr 2025
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 259 48 0 15 Oct 2024
Robust LLM safeguarding via refusal feature adversarial training L. Yu Virginie Do Karen Hambardzumyan Nicola Cancedda AAML 213 33 0 30 Sep 2024
Programming Refusal with Conditional Activation Steering Bruce W. Lee Inkit Padhi Karthikeyan N. Ramamurthy Erik Miehling Pierre Dognin Manish Nagireddy Amit Dhurandhar LLMSV 265 52 0 06 Sep 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories Tianlong Wang Xianfeng Jiao Yifan He Zhongzhi Chen Yinghao Zhu Xu Chu Junyi Gao Yasha Wang Liantao Ma LLMSV 205 33 0 26 May 2024
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators Yann Dubois Balázs Galambosi Percy Liang Tatsunori Hashimoto ALM 254 502 0 06 Apr 2024