Communities
Connect sessions
AI calendar
Organizations
Contact Sales
Search
Open menu
Home
Papers
All Papers
Title
Home
Papers
2410.12877
Cited By
v1
v2 (latest)
Improving Instruction-Following in Language Models through Activation Steering
15 October 2024
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Improving Instruction-Following in Language Models through Activation Steering"
48 / 98 papers shown
Title
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
194
146
0
03 Jan 2024
Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky
Nick Gabrieli
Julian Schulz
Meg Tong
Evan Hubinger
Alexander Matt Turner
LLMSV
158
341
0
09 Dec 2023
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou
Tianjian Lu
Swaroop Mishra
Siddhartha Brahma
Sujoy Basu
Yi Luan
Denny Zhou
Le Hou
ELM
ALM
LRM
148
430
0
14 Nov 2023
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
Sheng Liu
Haotian Ye
Lei Xing
James Y. Zou
186
162
0
11 Nov 2023
The Linear Representation Hypothesis and the Geometry of Large Language Models
Kiho Park
Yo Joong Choe
Victor Veitch
LLMSV
MILM
265
257
0
07 Nov 2023
In-Context Learning Creates Task Vectors
Roee Hendel
Mor Geva
Amir Globerson
166
206
0
24 Oct 2023
Function Vectors in Large Language Models
Eric Todd
Millicent Li
Arnab Sen Sharma
Aaron Mueller
Byron C. Wallace
David Bau
115
154
0
23 Oct 2023
Evaluating Large Language Models on Controlled Generation Tasks
Jiao Sun
Yufei Tian
Wangchunshu Zhou
Nan Xu
Qian Hu
Rahul Gupta
John Wieting
Nanyun Peng
Xuezhe Ma
LRM
ELM
146
75
0
23 Oct 2023
Towards Understanding Sycophancy in Language Models
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
...
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
558
343
0
20 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
268
289
0
10 Oct 2023
Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages
Shih-Cheng Huang
Pin-Zu Li
Yu-Chi Hsu
Kuang-Ming Chen
Yu Tung Lin
Shih-Kai Hsiao
Richard Tzong-Han Tsai
Hung-yi Lee
MoMe
148
25
0
07 Oct 2023
Language Models Represent Space and Time
Wes Gurnee
Max Tegmark
211
203
0
03 Oct 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
270
621
0
15 Sep 2023
Large Language Models as Optimizers
Chengrun Yang
Xuezhi Wang
Yifeng Lu
Hanxiao Liu
Quoc V. Le
Denny Zhou
Xinyun Chen
ODL
225
527
0
07 Sep 2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
Neel Nanda
Andrew Lee
Martin Wattenberg
FAtt
MILM
196
223
0
02 Sep 2023
A Geometric Notion of Causal Probing
Clément Guerner
Anej Svete
Tianyu Liu
Alex Warstadt
Robert Bamler
LLMSV
203
20
0
27 Jul 2023
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Kenneth Li
Oam Patel
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
KELM
HILM
305
696
0
06 Jun 2023
LEACE: Perfect linear concept erasure in closed form
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Robert Bamler
Edward Raff
Stella Biderman
KELM
MU
317
143
0
06 Jun 2023
Controlled Text Generation with Natural Language Instructions
Wangchunshu Zhou
Yuchen Eleanor Jiang
Ethan Gotlieb Wilcox
Robert Bamler
Mrinmaya Sachan
312
104
0
27 Apr 2023
The Internal State of an LLM Knows When It's Lying
A. Azaria
Tom Michael Mitchell
HILM
450
422
0
26 Apr 2023
OpenAssistant Conversations -- Democratizing Large Language Model Alignment
Andreas Kopf
Yannic Kilcher
Dimitri von Rutte
Sotiris Anagnostidis
Zhi Rui Tam
...
Arnav Dantuluri
Andrew Maguire
Christoph Schuhmann
Huu Nguyen
A. Mattick
ALM
LM&MA
232
711
0
14 Apr 2023
Inspecting and Editing Knowledge Representations in Language Models
Evan Hernandez
Belinda Z. Li
Jacob Andreas
KELM
168
108
0
03 Apr 2023
Large Language Model Instruction Following: A Survey of Progresses and Challenges
Renze Lou
Kai Zhang
Wenpeng Yin
ALM
LRM
367
33
0
18 Mar 2023
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
Shayne Longpre
Le Hou
Tu Vu
Albert Webson
Hyung Won Chung
...
Denny Zhou
Quoc V. Le
Barret Zoph
Jason W. Wei
Adam Roberts
ALM
229
765
0
31 Jan 2023
Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez
Sam Ringer
Kamilė Lukošiūtė
Karina Nguyen
Edwin Chen
...
Danny Hernandez
Deep Ganguli
Evan Hubinger
Nicholas Schiefer
Jared Kaplan
ALM
230
494
0
19 Dec 2022
Editing Models with Task Arithmetic
Gabriel Ilharco
Marco Tulio Ribeiro
Mitchell Wortsman
Suchin Gururangan
Ludwig Schmidt
Hannaneh Hajishirzi
Ali Farhadi
KELM
MoMe
MU
493
627
0
08 Dec 2022
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
296
464
0
07 Dec 2022
Scaling Instruction-Finetuned Language Models
Hyung Won Chung
Le Hou
Shayne Longpre
Barret Zoph
Yi Tay
...
Jacob Devlin
Adam Roberts
Denny Zhou
Quoc V. Le
Jason W. Wei
ReLM
LRM
671
3,487
0
20 Oct 2022
InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning
Prakhar Gupta
Cathy Jiao
Yi-Ting Yeh
Shikib Mehri
M. Eskénazi
Jeffrey P. Bigham
ALM
176
52
0
25 May 2022
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima
S. Gu
Machel Reid
Yutaka Matsuo
Yusuke Iwasawa
ReLM
LRM
996
5,247
0
24 May 2022
Extracting Latent Steering Vectors from Pretrained Language Models
Nishant Subramani
Nivedita Suresh
Matthew E. Peters
LLMSV
126
119
0
10 May 2022
What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment
Matthew Finlayson
Kyle Richardson
Ashish Sabharwal
Peter Clark
125
13
0
19 Apr 2022
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Yizhong Wang
Swaroop Mishra
Pegah Alipoormolabashi
Yeganeh Kordi
Amirreza Mirzaei
...
Chitta Baral
Yejin Choi
Noah A. Smith
Hannaneh Hajishirzi
Daniel Khashabi
ELM
315
939
0
16 Apr 2022
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
Mor Geva
Avi Caciularu
Ke Wang
Yoav Goldberg
KELM
294
426
0
28 Mar 2022
Gold Doesn't Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information
Shun Shao
Yftah Ziser
Shay B. Cohen
AAML
139
32
0
15 Mar 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
1.6K
15,250
0
04 Mar 2022
A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models
Hanqing Zhang
Haolin Song
Shaoyu Li
Ming Zhou
Dawei Song
266
262
0
14 Jan 2022
A General Language Assistant as a Laboratory for Alignment
Amanda Askell
Yuntao Bai
Anna Chen
Dawn Drain
Deep Ganguli
...
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Jared Kaplan
ALM
191
866
0
01 Dec 2021
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
500
1,802
0
15 Oct 2021
Finetuned Language Models Are Zero-Shot Learners
Jason W. Wei
Maarten Bosma
Vincent Zhao
Kelvin Guu
Adams Wei Yu
Brian Lester
Nan Du
Andrew M. Dai
Quoc V. Le
ALM
UQCV
636
4,172
0
03 Sep 2021
CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP
Qinyuan Ye
Bill Yuchen Lin
Xiang Ren
373
189
0
18 Apr 2021
Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Swaroop Mishra
Daniel Khashabi
Chitta Baral
Hannaneh Hajishirzi
LRM
326
804
0
18 Apr 2021
Array Programming with NumPy
Charles R. Harris
K. Millman
S. Walt
R. Gommers
Pauli Virtanen
...
Tyler Reddy
Warren Weckesser
Hameer Abbasi
C. Gohlke
T. Oliphant
321
16,484
0
18 Jun 2020
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection
Shauli Ravfogel
Yanai Elazar
Hila Gonen
Michael Twiton
Yoav Goldberg
271
423
0
16 Apr 2020
Plug and Play Language Models: A Simple Approach to Controlled Text Generation
Sumanth Dathathri
Andrea Madotto
Janice Lan
Jane Hung
Eric Frank
Piero Molino
J. Yosinski
Rosanne Liu
KELM
230
1,034
0
04 Dec 2019
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke
Sam Gross
Francisco Massa
Adam Lerer
James Bradbury
...
Sasank Chilamkurthy
Benoit Steiner
Lu Fang
Junjie Bai
Soumith Chintala
ODL
808
45,179
0
03 Dec 2019
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
Tolga Bolukbasi
Kai-Wei Chang
James Zou
Venkatesh Saligrama
Adam Kalai
CVBM
FaML
255
3,305
0
21 Jul 2016
Distributed Representations of Words and Phrases and their Compositionality
Tomas Mikolov
Ilya Sutskever
Kai Chen
G. Corrado
J. Dean
NAI
OCL
674
33,976
0
16 Oct 2013
Previous
1
2