ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.00402
  4. Cited By
Investigating Bias Representations in Llama 2 Chat via Activation
  Steering

Investigating Bias Representations in Llama 2 Chat via Activation Steering

1 February 2024
Dawn Lu
Nina Rimsky
    LLMSV
ArXiv (abs)PDFHTML

Papers citing "Investigating Bias Representations in Llama 2 Chat via Activation Steering"

9 / 9 papers shown
Title
Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions
Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions
Yoonah Park
Haesung Pyun
Yohan Jo
KELM
32
0
0
28 Sep 2025
Improving Multilingual Language Models by Aligning Representations through Steering
Improving Multilingual Language Models by Aligning Representations through Steering
Omar Mahmoud
B. L. Semage
Thommen George Karimpanal
Santu Rana
LLMSV
195
2
0
19 May 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
248
2
0
24 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MUAAMLELM
341
16
0
03 Feb 2025
Improving Instruction-Following in Language Models through Activation Steering
Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
259
48
0
15 Oct 2024
Programming Refusal with Conditional Activation Steering
Programming Refusal with Conditional Activation Steering
Bruce W. Lee
Inkit Padhi
Karthikeyan N. Ramamurthy
Erik Miehling
Pierre Dognin
Manish Nagireddy
Amit Dhurandhar
LLMSV
265
52
0
06 Sep 2024
Steering Without Side Effects: Improving Post-Deployment Control of
  Language Models
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland
Alexander Lyzhov
Jacob Pfau
Salsabila Mahdi
Samuel R. Bowman
LLMSVAAML
156
31
0
21 Jun 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Stephen Casper
Lennart Schulze
Oam Patel
Dylan Hadfield-Menell
AAML
299
49
0
08 Mar 2024
Eight Methods to Evaluate Robust Unlearning in LLMs
Eight Methods to Evaluate Robust Unlearning in LLMs
Aengus Lynch
Phillip Guo
Aidan Ewart
Stephen Casper
Dylan Hadfield-Menell
ELMMU
202
101
0
26 Feb 2024
1