Analyzing the Generalization and Reliability of Steering Vectors

Analyzing the Generalization and Reliability of Steering Vectors

17 July 2024

Dimitrios Kanoulas

Adrià Garriga-Alonso

Papers citing "Analyzing the Generalization and Reliability of Steering Vectors"

10 / 10 papers shown

Title
Patterns and Mechanisms of Contrastive Activation Engineering Yixiong Hao Ayush Panda Stepan Shabalin Sheikh Abdur Raheem Ali LLMSV 58 0 0 06 May 2025
On the Limitations of Steering in Language Model Alignment Chebrolu Niranjan Kokil Jaidka G. Yeo LLMSV 31 0 0 02 May 2025
Programming Refusal with Conditional Activation Steering Bruce W. Lee Inkit Padhi K. Ramamurthy Erik Miehling Pierre L. Dognin Manish Nagireddy Amit Dhurandhar LLMSV 87 13 0 06 Sep 2024
The Platonic Representation Hypothesis Minyoung Huh Brian Cheung Tongzhou Wang Phillip Isola 72 107 0 13 May 2024
Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models Goutham Rajendran Simon Buchholz Bryon Aragam Bernhard Schölkopf Pradeep Ravikumar AI4CE 67 19 0 14 Feb 2024
Universal Neurons in GPT2 Language Models Wes Gurnee Theo Horsley Zifan Carl Guo Tara Rezaei Kheirkhah Qinyi Sun Will Hathaway Neel Nanda Dimitris Bertsimas MILM 83 37 0 22 Jan 2024
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 91 164 0 10 Oct 2023
Understanding the Effects of RLHF on LLM Generalisation and Diversity Robert Kirk Ishita Mediratta Christoforos Nalmpantis Jelena Luketina Eric Hambro Edward Grefenstette Roberta Raileanu AI4CE ALM 95 63 0 10 Oct 2023
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason W. Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter F. Xia Ed H. Chi Quoc Le Denny Zhou LM&Ro LRM AI4CE ReLM 313 8,261 0 28 Jan 2022
Fine-Tuning Language Models from Human Preferences Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano G. Irving ALM 273 1,561 0 18 Sep 2019