Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.03415
Cited By
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
4 October 2024
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation"
4 / 4 papers shown
Title
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David E. Evans
LLMSV
67
0
0
23 Apr 2025
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
Jiahe Guo
Yulin Hu
Yang Deng
An Zhang
...
Xinyang Han
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
AAML
LLMSV
37
0
0
13 Apr 2025
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior
S.
Xinpeng Wang
Guangyao Zhai
Nassir Navab
Barbara Plank
LLMAG
51
0
0
22 Mar 2025
Steering Language Model Refusal with Sparse Autoencoders
Kyle O'Brien
David Majercak
Xavier Fernandes
Richard Edgar
Jingya Chen
Harsha Nori
Dean Carignan
Eric Horvitz
Forough Poursabzi-Sangde
LLMSV
52
9
0
18 Nov 2024
1