Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.00034
Cited By
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
26 May 2024
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories"
10 / 10 papers shown
Title
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
Jiahe Guo
Yulin Hu
Yang Deng
An Zhang
...
Xinyang Han
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
AAML
LLMSV
37
0
0
13 Apr 2025
Effectively Steer LLM To Follow Preference via Building Confident Directions
Bingqing Song
Boran Han
Shuai Zhang
Hao Wang
Haoyang Fang
Bonan Min
Yuyang Wang
Mingyi Hong
LLMSV
42
0
0
04 Mar 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
98
0
0
24 Feb 2025
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Haiyan Zhao
Heng Zhao
Bo Shen
Ali Payani
Fan Yang
Mengnan Du
55
2
0
30 Sep 2024
Programming Refusal with Conditional Activation Steering
Bruce W. Lee
Inkit Padhi
K. Ramamurthy
Erik Miehling
Pierre L. Dognin
Manish Nagireddy
Amit Dhurandhar
LLMSV
87
13
0
06 Sep 2024
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
117
314
0
21 Sep 2022
Teaching language models to support answers with verified quotes
Jacob Menick
Maja Trebacz
Vladimir Mikulik
John Aslanides
Francis Song
...
Mia Glaese
Susannah Young
Lucy Campbell-Gillingham
G. Irving
Nat McAleese
ELM
RALM
213
204
0
21 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
313
8,261
0
28 Jan 2022
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
219
291
0
24 Feb 2021
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
273
1,561
0
18 Sep 2019
1