Communities
Connect sessions
AI calendar
Organizations
Contact Sales
Search
Open menu
Home
Papers
2402.00402
Cited By
Investigating Bias Representations in Llama 2 Chat via Activation Steering
1 February 2024
Dawn Lu
Nina Rimsky
LLMSV
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Investigating Bias Representations in Llama 2 Chat via Activation Steering"
9 / 9 papers shown
Title
Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions
Yoonah Park
Haesung Pyun
Yohan Jo
KELM
32
0
0
28 Sep 2025
Improving Multilingual Language Models by Aligning Representations through Steering
Omar Mahmoud
B. L. Semage
Thommen George Karimpanal
Santu Rana
LLMSV
195
2
0
19 May 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
248
2
0
24 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MU
AAML
ELM
341
16
0
03 Feb 2025
Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
259
48
0
15 Oct 2024
Programming Refusal with Conditional Activation Steering
Bruce W. Lee
Inkit Padhi
Karthikeyan N. Ramamurthy
Erik Miehling
Pierre Dognin
Manish Nagireddy
Amit Dhurandhar
LLMSV
265
52
0
06 Sep 2024
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland
Alexander Lyzhov
Jacob Pfau
Salsabila Mahdi
Samuel R. Bowman
LLMSV
AAML
156
31
0
21 Jun 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Stephen Casper
Lennart Schulze
Oam Patel
Dylan Hadfield-Menell
AAML
299
49
0
08 Mar 2024
Eight Methods to Evaluate Robust Unlearning in LLMs
Aengus Lynch
Phillip Guo
Aidan Ewart
Stephen Casper
Dylan Hadfield-Menell
ELM
MU
202
101
0
26 Feb 2024
1