Steering Without Side Effects: Improving Post-Deployment Control of Language Models

21 June 2024

Papers citing "Steering Without Side Effects: Improving Post-Deployment Control of Language Models"

6 / 6 papers shown

Title
Evaluating the Prompt Steerability of Large Language Models Erik Miehling Michael Desmond K. Ramamurthy Elizabeth M. Daly Pierre L. Dognin Jesus Rios Djallel Bouneffouf Miao Liu LLMSV 85 3 0 19 Nov 2024
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 40 13 0 15 Oct 2024
Robust LLM safeguarding via refusal feature adversarial training L. Yu Virginie Do Karen Hambardzumyan Nicola Cancedda AAML 42 9 0 30 Sep 2024
Representation Tuning Christopher M. Ackerman LLMSV 19 0 0 11 Sep 2024
Programming Refusal with Conditional Activation Steering Bruce W. Lee Inkit Padhi K. Ramamurthy Erik Miehling Pierre L. Dognin Manish Nagireddy Amit Dhurandhar LLMSV 87 13 0 06 Sep 2024
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought James Chua Edward Rees Hunar Batra Samuel R. Bowman Julian Michael Ethan Perez Miles Turpin LRM 30 13 0 08 Mar 2024