Multi-property Steering of Large Language Models with Dynamic Activation
Composition

Multi-property Steering of Large Language Models with Dynamic Activation Composition

25 June 2024

Papers citing "Multi-property Steering of Large Language Models with Dynamic Activation Composition"

11 / 11 papers shown

Title
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control Hannah Cyberey David E. Evans LLMSV 74 0 0 23 Apr 2025
Activation Steering in Neural Theorem Provers Shashank Kirtania LLMSV 126 0 0 21 Feb 2025
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 52 14 0 15 Oct 2024
Programming Refusal with Conditional Activation Steering Bruce W. Lee Inkit Padhi K. Ramamurthy Erik Miehling Pierre L. Dognin Manish Nagireddy Amit Dhurandhar LLMSV 91 13 0 06 Sep 2024
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 91 168 0 10 Oct 2023
Understanding the Effects of RLHF on LLM Generalisation and Diversity Robert Kirk Ishita Mediratta Christoforos Nalmpantis Jelena Luketina Eric Hambro Edward Grefenstette Roberta Raileanu AI4CE ALM 97 121 0 10 Oct 2023
RAMP: Retrieval and Attribute-Marking Enhanced Prompting for Attribute-Controlled Translation Gabriele Sarti Phu Mon Htut Xing Niu B. Hsu Anna Currey Georgiana Dinu Maria Nadejde LRM 37 9 0 26 May 2023
Prompt-and-Rerank: A Method for Zero-Shot and Few-Shot Arbitrary Textual Style Transfer with Small Language Models Mirac Suzgun Luke Melas-Kyriazi Dan Jurafsky VLM 77 65 0 23 May 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 306 11,909 0 04 Mar 2022
ETC-NLG: End-to-end Topic-Conditioned Natural Language Generation Ginevra Carbone Gabriele Sarti 24 9 0 25 Aug 2020
Fine-Tuning Language Models from Human Preferences Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano G. Irving ALM 275 1,587 0 18 Sep 2019