Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.15518
Cited By
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
21 June 2024
Asa Cooper Stickland
Alexander Lyzhov
Jacob Pfau
Salsabila Mahdi
Samuel R. Bowman
LLMSV
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Steering Without Side Effects: Improving Post-Deployment Control of Language Models"
6 / 6 papers shown
Title
Evaluating the Prompt Steerability of Large Language Models
Erik Miehling
Michael Desmond
K. Ramamurthy
Elizabeth M. Daly
Pierre L. Dognin
Jesus Rios
Djallel Bouneffouf
Miao Liu
LLMSV
85
3
0
19 Nov 2024
Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
40
13
0
15 Oct 2024
Robust LLM safeguarding via refusal feature adversarial training
L. Yu
Virginie Do
Karen Hambardzumyan
Nicola Cancedda
AAML
42
9
0
30 Sep 2024
Representation Tuning
Christopher M. Ackerman
LLMSV
19
0
0
11 Sep 2024
Programming Refusal with Conditional Activation Steering
Bruce W. Lee
Inkit Padhi
K. Ramamurthy
Erik Miehling
Pierre L. Dognin
Manish Nagireddy
Amit Dhurandhar
LLMSV
87
13
0
06 Sep 2024
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
James Chua
Edward Rees
Hunar Batra
Samuel R. Bowman
Julian Michael
Ethan Perez
Miles Turpin
LRM
30
13
0
08 Mar 2024
1