Investigating Bias Representations in Llama 2 Chat via Activation Steering

1 February 2024

Papers citing "Investigating Bias Representations in Llama 2 Chat via Activation Steering"

9 / 9 papers shown

Title
Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions Yoonah Park Haesung Pyun Yohan Jo KELM 32 0 0 28 Sep 2025
Improving Multilingual Language Models by Aligning Representations through Steering Omar Mahmoud B. L. Semage Thommen George Karimpanal Santu Rana LLMSV 195 2 0 19 May 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges Lukasz Bartoszcze Sarthak Munshi Bryan Sukidi Jennifer Yen Zejia Yang David Williams-King Linh Le Kosi Asuzu Carsten Maple 248 2 0 24 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities Zora Che Stephen Casper Robert Kirk Anirudh Satheesh Stewart Slocum ... Zikui Cai Bilal Chughtai Y. Gal Furong Huang Dylan Hadfield-Menell MU AAML ELM 341 16 0 03 Feb 2025
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 259 48 0 15 Oct 2024
Programming Refusal with Conditional Activation Steering Bruce W. Lee Inkit Padhi Karthikeyan N. Ramamurthy Erik Miehling Pierre Dognin Manish Nagireddy Amit Dhurandhar LLMSV 265 52 0 06 Sep 2024
Steering Without Side Effects: Improving Post-Deployment Control of Language Models Asa Cooper Stickland Alexander Lyzhov Jacob Pfau Salsabila Mahdi Samuel R. Bowman LLMSV AAML 156 31 0 21 Jun 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial Training Stephen Casper Lennart Schulze Oam Patel Dylan Hadfield-Menell AAML 299 49 0 08 Mar 2024
Eight Methods to Evaluate Robust Unlearning in LLMs Aengus Lynch Phillip Guo Aidan Ewart Stephen Casper Dylan Hadfield-Menell ELM MU 202 101 0 26 Feb 2024