FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering

v1v2 (latest)

FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering

Annual Meeting of the Association for Computational Linguistics (ACL), 2025

20 April 2025

ArXiv (abs)PDF HTML

Papers citing "FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering"

12 / 12 papers shown

Title
Silenced Biases: The Dark Side LLMs Learned to Refuse Rom Himelstein Amit Levi Brit Youngmann Yaniv Nemcovsky A. Mendelson 74 1 0 05 Nov 2025
Can SAEs reveal and mitigate racial biases of LLMs in healthcare? Hiba Ahsan Byron C. Wallace LLMSV 121 0 0 31 Oct 2025
Angular Steering: Behavior Control via Rotation in Activation Space Hieu M. Vu T. Nguyen LLMSV 288 3 0 30 Oct 2025
Robust Preference Alignment via Directional Neighborhood Consensus Ruochen Mao Yuling Shi Xiaodong Gu Jiaheng Wei 135 0 0 23 Oct 2025
Debiasing LLMs by Masking Unfairness-Driving Attention Heads Tingxu Han Wei Song Ziqi Ding Z. Li Chunrong Fang Yuekang Li Dongfang Liu Zhenyu Chen Zhenting Wang 151 0 0 11 Oct 2025
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses Xin Xu Xunzhi He Churan Zhi Ruizhe Chen Julian McAuley Zexue He 50 0 0 30 Sep 2025
BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them Sekh Mainul Islam Nadav Borenstein Siddhesh Pawar Haeun Yu Arnav Arora Isabelle Augenstein 154 1 0 12 Aug 2025
BiasFilter: An Inference-Time Debiasing Framework for Large Language Models Xiaoqing Cheng Ruizhe Chen Hongying Zan Yuxiang Jia Min Peng 247 1 0 28 May 2025
SAEs Are Good for Steering -- If You Select the Right Features Dana Arad Aaron Mueller Yonatan Belinkov LLMSV 179 19 0 26 May 2025
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models Zirui He Haoyang Ling Bo Shen Ali Payani Zelong Li Mengnan Du LLMSV 348 7 0 22 May 2025
Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs Amr Hegazy Mostafa Elhoushi Amr Alanwar LLMSV 265 2 0 22 May 2025
BiasGuard: A Reasoning-enhanced Bias Detection Tool For Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 Zhiting Fan Ruizhe Chen Zuozhu Liu 333 4 0 30 Apr 2025