ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.09121
  4. Cited By
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
v1v2 (latest)

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

12 July 2024
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Shu Yang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
ArXiv (abs)PDFHTMLHuggingFace (6 upvotes)Github (72★)

Papers citing "Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training"

50 / 84 papers shown
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
Junbo Zhang
Ran Chen
Qianli Zhou
Xinyang Deng
Wen Jiang
232
3
0
24 Nov 2025
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Zihao Yi
Qingxuan Jiang
Ruotian Ma
Xingyu Chen
Qu Yang
...
Fanghua Ye
Ying Shen
Zhaopeng Tu
Xiaolong Li
Linus
279
4
0
07 Nov 2025
Read the Scene, Not the Script: Outcome-Aware Safety for LLMs
Read the Scene, Not the Script: Outcome-Aware Safety for LLMs
Rui Wu
Yihao Quan
Zeru Shi
Zhenting Wang
Yanshu Li
Ruixiang Tang
177
1
0
05 Oct 2025
How Catastrophic is Your LLM? Certifying Risk in Conversation
How Catastrophic is Your LLM? Certifying Risk in Conversation
Chengxiao Wang
Isha Chaudhary
Qian Hu
Weitong Ruan
Rahul Gupta
Gagandeep Singh
199
1
0
04 Oct 2025
BiasGym: A Simple and Generalizable Framework for Analyzing and Removing Biases through Elicitation
BiasGym: A Simple and Generalizable Framework for Analyzing and Removing Biases through Elicitation
Sekh Mainul Islam
Nadav Borenstein
Siddhesh Pawar
Haeun Yu
Arnav Arora
Isabelle Augenstein
289
0
0
12 Aug 2025
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
Christina Q. Knight
Kaustubh Deshpande
Ved Sirdeshmukh
Meher Mankikar
Scale Red Team
SEAL Research Team
Julian Michael
AAMLELM
380
8
0
17 Jun 2025
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Hiroshi Matsuda
Chunpeng Ma
Masayuki Asahara
402
9
0
11 Jun 2025
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Yang Li
Qiang Sheng
Yehan Yang
Xueyao Zhang
Juan Cao
432
12
0
11 Jun 2025
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models
Youze Wang
Wenbo Hu
Yinpeng Dong
Jing Liu
Hanwang Zhang
Richang Hong
378
16
0
02 Jun 2025
A Red Teaming Roadmap Towards System-Level Safety
A Red Teaming Roadmap Towards System-Level Safety
Zifan Wang
Christina Q. Knight
Jeremy Kritz
Willow Primack
Julian Michael
AAML
384
2
0
30 May 2025
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities
Sahil Verma
Keegan E. Hines
J. Bilmes
Charlotte Siska
Luke Zettlemoyer
Hila Gonen
Chandan Singh
AAML
727
5
0
29 May 2025
Lifelong Safety Alignment for Language Models
Lifelong Safety Alignment for Language Models
Haoyu Wang
Zeyu Qin
Yifei Zhao
C. Du
Min Lin
Xueqian Wang
Tianyu Pang
KELMCLL
395
7
0
26 May 2025
Refusal Direction is Universal Across Safety-Aligned Languages
Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang
Mingyang Wang
Yihong Liu
Hinrich Schutze
Barbara Plank
565
7
0
22 May 2025
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
Kaiwen Zhou
Xuandong Zhao
Gaowen Liu
Jayanth Srinivasa
Aosong Feng
Dawn Song
Xin Eric Wang
LRMLLMSV
397
18
0
22 May 2025
Safety Alignment Can Be Not Superficial With Explicit Safety Signals
Safety Alignment Can Be Not Superficial With Explicit Safety Signals
Jianwei Li
Jung-Eng Kim
AAML
510
7
0
19 May 2025
Safe Vision-Language Models via Unsafe Weights Manipulation
Safe Vision-Language Models via Unsafe Weights Manipulation
Moreno DÍncà
E. Peruzzo
Xingqian Xu
Humphrey Shi
Andrii Zadaianchuk
Goran Frehse
MU
603
1
0
14 Mar 2025
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Hanjiang Hu
Alexander Robey
Changliu Liu
AAMLLLMSV
554
15
0
28 Feb 2025
Practical Principles for AI Cost and Compute Accounting
Practical Principles for AI Cost and Compute Accounting
Stephen Casper
Luke Bailey
Tim Schreier
394
3
0
21 Feb 2025
RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars
RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars
Yuncheng Hua
Zhuang Li
Zhuang Li
Hao Xue
Flora D. Salim
Gholamreza Haffari
ALM
622
2
0
17 Feb 2025
Trustworthy AI: Safety, Bias, and Privacy -- A Survey
Trustworthy AI: Safety, Bias, and Privacy -- A Survey
Xingli Fang
Jianwei Li
Varun Mulchandani
Jung-Eun Kim
453
0
0
11 Feb 2025
Safety Reasoning with Guidelines
Safety Reasoning with Guidelines
Haoyu Wang
Zeyu Qin
Li Shen
Xueqian Wang
Minhao Cheng
Dacheng Tao
537
4
0
06 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MUAAMLELM
743
34
0
03 Feb 2025
HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor
HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor
Zihui Wu
Haichang Gao
Jiacheng Luo
Zhaoxiang Liu
551
2
0
23 Jan 2025
Regression for the Mean: Auto-Evaluation and Inference with Few Labels through Post-hoc Regression
Regression for the Mean: Auto-Evaluation and Inference with Few Labels through Post-hoc Regression
Benjamin Eyre
David Madras
526
5
0
19 Nov 2024
POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization
POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization
Batuhan K. Karaman
Ishmam Zabir
Alon Benhaim
Vishrav Chaudhary
M. Sabuncu
Xia Song
AI4CE
433
5
0
16 Oct 2024
Difficult Task Yes but Simple Task No: Unveiling the Laziness in
  Multimodal LLMs
Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Sihang Zhao
Youliang Yuan
Xiaoying Tang
Pinjia He
230
5
0
15 Oct 2024
Locking Down the Finetuned LLMs Safety
Locking Down the Finetuned LLMs Safety
Minjun Zhu
Linyi Yang
Yifan Wei
Ningyu Zhang
Yue Zhang
374
25
0
14 Oct 2024
HiddenGuard: Fine-Grained Safe Generation with Specialized
  Representation Router
HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router
Lingrui Mei
Shenghua Liu
Yiwei Wang
Baolong Bi
Ruibin Yuan
Xueqi Cheng
290
11
0
03 Oct 2024
Endless Jailbreaks with Bijection Learning
Endless Jailbreaks with Bijection LearningInternational Conference on Learning Representations (ICLR), 2024
Brian R. Y. Huang
Maximilian Li
Leonard Tang
AAML
410
16
0
02 Oct 2024
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Nathaniel Li
Ziwen Han
Ian Steneker
Willow Primack
Riley Goodside
Hugh Zhang
Zifan Wang
Cristina Menghini
Summer Yue
AAMLMU
409
133
0
27 Aug 2024
The Dark Side of Function Calling: Pathways to Jailbreaking Large
  Language Models
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
Zihui Wu
Haichang Gao
Jianping He
Ping Wang
429
19
0
25 Jul 2024
Course-Correction: Safety Alignment Using Synthetic Preferences
Course-Correction: Safety Alignment Using Synthetic Preferences
Rongwu Xu
Yishuo Cai
Zhenhong Zhou
Renjie Gu
Haiqin Weng
Yan Liu
Tianwei Zhang
Wei Xu
Han Qiu
297
14
0
23 Jul 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi
Alexander Wei
Eric Wallace
Tony T. Wang
Nika Haghtalab
Jacob Steinhardt
SILMAAML
280
73
0
28 Jun 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Safety Alignment Should Be Made More Than Just a Few Tokens DeepInternational Conference on Learning Representations (ICLR), 2024
Xiangyu Qi
Ashwinee Panda
Kaifeng Lyu
Xiao Ma
Subhrajit Roy
Ahmad Beirami
Prateek Mittal
Peter Henderson
298
348
0
10 Jun 2024
Improving Alignment and Robustness with Circuit Breakers
Improving Alignment and Robustness with Circuit BreakersNeural Information Processing Systems (NeurIPS), 2024
Andy Zou
Long Phan
Justin Wang
Derek Duenas
Maxwell Lin
Maksym Andriushchenko
Rowan Wang
Zico Kolter
Matt Fredrikson
Dan Hendrycks
AAML
728
252
0
06 Jun 2024
Jailbreaking Large Language Models Against Moderation Guardrails via
  Cipher Characters
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
Haibo Jin
Andy Zhou
Joe D. Menke
Haohan Wang
263
45
0
30 May 2024
Protecting Your LLMs with Information Bottleneck
Protecting Your LLMs with Information Bottleneck
Zichuan Liu
Zefan Wang
Linjie Xu
Jinyu Wang
Lei Song
Tianchun Wang
Chunlin Chen
Wei Cheng
Jiang Bian
KELMAAML
276
34
0
22 Apr 2024
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Anselm Paulus
Arman Zharmagambetov
Chuan Guo
Brandon Amos
Yuandong Tian
AAML
427
147
0
21 Apr 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged
  Instructions
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace
Kai Y. Xiao
R. Leike
Lilian Weng
Johannes Heidecke
Alex Beutel
SILM
428
290
0
19 Apr 2024
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path
  Forward
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward
Xuan Xie
Yuheng Huang
Zhehua Zhou
Yuheng Huang
Da Song
Lei Ma
OffRL
434
12
0
12 Apr 2024
Detoxifying Large Language Models via Knowledge Editing
Detoxifying Large Language Models via Knowledge Editing
Meng Wang
Ningyu Zhang
Ziwen Xu
Zekun Xi
Shumin Deng
Yunzhi Yao
Qishen Zhang
Linyi Yang
Yongfeng Zhang
Huajun Chen
KELM
432
100
0
21 Mar 2024
RigorLLM: Resilient Guardrails for Large Language Models against
  Undesired Content
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
Zhuowen Yuan
Zidi Xiong
Yi Zeng
Ning Yu
Ruoxi Jia
Basel Alomair
Yue Liu
AAMLKELM
338
73
0
19 Mar 2024
CodeAttack: Revealing Safety Generalization Challenges of Large Language
  Models via Code Completion
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code CompletionAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Qibing Ren
Chang Gao
Jing Shao
Junchi Yan
Xin Tan
Wai Lam
Lizhuang Ma
ALMELMAAML
520
65
0
12 Mar 2024
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by
  Exploring Refusal Loss Landscapes
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
Xiaomeng Hu
Pin-Yu Chen
Tsung-Yi Ho
AAML
265
69
0
01 Mar 2024
SoFA: Shielded On-the-fly Alignment via Priority Rule Following
SoFA: Shielded On-the-fly Alignment via Priority Rule Following
Xinyu Lu
Bowen Yu
Yaojie Lu
Hongyu Lin
Haiyang Yu
Le Sun
Xianpei Han
Yongbin Li
231
20
0
27 Feb 2024
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware
  Decoding
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
Zhangchen Xu
Fengqing Jiang
Luyao Niu
Jinyuan Jia
Bill Yuchen Lin
Radha Poovendran
AAML
637
235
0
14 Feb 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
  and Robust Refusal
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
495
938
0
06 Feb 2024
PsySafe: A Comprehensive Framework for Psychological-based Attack,
  Defense, and Evaluation of Multi-agent System Safety
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System SafetyAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Zaibin Zhang
Yongting Zhang
Lijun Li
Hongzhi Gao
Lijun Wang
Huchuan Lu
Feng Zhao
Yu Qiao
Jing Shao
LLMAG
429
80
0
22 Jan 2024
Self-Rewarding Language Models
Self-Rewarding Language Models
Weizhe Yuan
Richard Yuanzhe Pang
Kyunghyun Cho
Xian Li
Sainbayar Sukhbaatar
Jing Xu
Jason Weston
ReLMSyDaALMLRM
989
540
0
18 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
  Challenge AI Safety by Humanizing LLMs
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
451
568
0
12 Jan 2024
12
Next
Page 1 of 2