ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.03415
  4. Cited By
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

4 October 2024
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
ArXivPDFHTML

Papers citing "Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation"

4 / 4 papers shown
Title
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David E. Evans
LLMSV
67
0
0
23 Apr 2025
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
Jiahe Guo
Yulin Hu
Yang Deng
An Zhang
...
Xinyang Han
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
AAML
LLMSV
37
0
0
13 Apr 2025
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior
S.
Xinpeng Wang
Guangyao Zhai
Nassir Navab
Barbara Plank
LLMAG
51
0
0
22 Mar 2025
Steering Language Model Refusal with Sparse Autoencoders
Kyle O'Brien
David Majercak
Xavier Fernandes
Richard Edgar
Jingya Chen
Harsha Nori
Dean Carignan
Eric Horvitz
Forough Poursabzi-Sangde
LLMSV
52
9
0
18 Nov 2024
1