v1v2 (latest)

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

12 July 2024

ArXiv (abs)PDF HTML HuggingFace (6 upvotes)Github (72★)

Papers citing "Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training"

50 / 84 papers shown

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

232

24 Nov 2025

Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

...

279

07 Nov 2025

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

177

05 Oct 2025

How Catastrophic is Your LLM? Certifying Risk in Conversation

199

04 Oct 2025

BiasGym: A Simple and Generalizable Framework for Analyzing and Removing Biases through Elicitation

289

12 Aug 2025

FORTRESS: Frontier Risk Evaluation for National Security and Public Safety

380

17 Jun 2025

Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs

Hiroshi Matsuda

Chunpeng Ma

Masayuki Asahara

402

11 Jun 2025

From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

432

11 Jun 2025

Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models

378

02 Jun 2025

A Red Teaming Roadmap Towards System-Level Safety

384

30 May 2025

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities

727

29 May 2025

Lifelong Safety Alignment for Language Models

395

26 May 2025

Refusal Direction is Universal Across Safety-Aligned Languages

565

22 May 2025

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

397

22 May 2025

Safety Alignment Can Be Not Superficial With Explicit Safety Signals

Jianwei Li

Jung-Eng Kim

AAML

510

19 May 2025

Safe Vision-Language Models via Unsafe Weights Manipulation

603

14 Mar 2025

Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks

554

28 Feb 2025

Practical Principles for AI Cost and Compute Accounting

Stephen Casper

Luke Bailey

Tim Schreier

394

21 Feb 2025

RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars

622

17 Feb 2025

Trustworthy AI: Safety, Bias, and Privacy -- A Survey

453

11 Feb 2025

Safety Reasoning with Guidelines

537

06 Feb 2025

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

...

Dylan Hadfield-Menell

MU AAML ELM

743

03 Feb 2025

HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor

551

23 Jan 2025

Regression for the Mean: Auto-Evaluation and Inference with Few Labels through Post-hoc Regression

Benjamin Eyre

David Madras

526

19 Nov 2024

POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

433

16 Oct 2024

Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

230

15 Oct 2024

Locking Down the Finetuned LLMs Safety

Minjun Zhu

Linyi Yang

Yifan Wei

Ningyu Zhang

Yue Zhang

374

14 Oct 2024

HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

Lingrui Mei

Shenghua Liu

Yiwei Wang

Baolong Bi

Ruibin Yuan

Xueqi Cheng

290

03 Oct 2024

Endless Jailbreaks with Bijection LearningInternational Conference on Learning Representations (ICLR), 2024

410

02 Oct 2024

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

409

133

27 Aug 2024

The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models

429

25 Jul 2024

Course-Correction: Safety Alignment Using Synthetic Preferences

Haiqin Weng

Yan Liu

Tianwei Zhang

Wei Xu

Han Qiu

297

23 Jul 2024

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

280

28 Jun 2024

Safety Alignment Should Be Made More Than Just a Few Tokens DeepInternational Conference on Learning Representations (ICLR), 2024

298

348

10 Jun 2024

Improving Alignment and Robustness with Circuit BreakersNeural Information Processing Systems (NeurIPS), 2024

Maksym Andriushchenko

728

252

06 Jun 2024

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Haibo Jin

Andy Zhou

Joe D. Menke

Haohan Wang

263

30 May 2024

Protecting Your LLMs with Information Bottleneck

Linjie Xu

Lei Song

Jiang Bian

276

22 Apr 2024

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

427

147

21 Apr 2024

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

428

290

19 Apr 2024

Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward

434

12 Apr 2024

Detoxifying Large Language Models via Knowledge Editing

Ningyu Zhang

Shumin Deng

Huajun Chen

432

100

21 Mar 2024

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

Yi Zeng

338

19 Mar 2024

CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code CompletionAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

520

12 Mar 2024

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes

Xiaomeng Hu

Pin-Yu Chen

Tsung-Yi Ho

AAML

265

01 Mar 2024

SoFA: Shielded On-the-fly Alignment via Priority Rule Following

Xinyu Lu

Bowen Yu

Yaojie Lu

Xianpei Han

231

27 Feb 2024

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

637

235

14 Feb 2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

...

495

938

06 Feb 2024

PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System SafetyAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Huchuan Lu

Yu Qiao

429

22 Jan 2024

Self-Rewarding Language Models

Xian Li

Jason Weston

989

540

18 Jan 2024

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Yi Zeng

Hongpeng Lin

Jingwen Zhang

Diyi Yang

Ruoxi Jia

Weiyan Shi

451

568

12 Jan 2024