v1v2 (latest)

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

6 February 2024

ArXiv (abs)PDF HTML HuggingFace (6 upvotes)Github (652★)

Papers citing "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"

50 / 490 papers shown

Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models

Mahesh Kumar Nandwana

347

05 Dec 2025

Matching Ranks Over Probability Yields Truly Deep Safety Alignment

Jason Vega

Gagandeep Singh

AAML

05 Dec 2025

Are Your Agents Upward Deceivers?

...

179

04 Dec 2025

Context-Aware Hierarchical Learning: A Two-Step Paradigm towards Safer LLMs

177

03 Dec 2025

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories

283

03 Dec 2025

CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

Lavish Bansal

Naman Mishra

02 Dec 2025

Lumos: Let there be Language Model System Certification

02 Dec 2025

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

27 Nov 2025

A Safety and Security Framework for Real-World Agentic Systems

...

27 Nov 2025

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

Richard J. Young

ELM

155

27 Nov 2025

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Dongkyu Derek Cho

Huan Song

Arijit Ghosh Chowdhury

164

26 Nov 2025

InvisibleBench: A Deployment Gate for Caregiving Relationship AI

Ali Madad

132

25 Nov 2025

Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion

227

24 Nov 2025

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

185

24 Nov 2025

Automating Deception: Scalable Multi-Turn LLM Jailbreaks

Adarsh Kumarappan

Ananya Mujoo

AAML

128

24 Nov 2025

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

192

24 Nov 2025

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

459

24 Nov 2025

FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models

M. Fatehkia

Enes Altinisik

Husrev Taha Sencar

116

24 Nov 2025

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

304

23 Nov 2025

ASTRA: Agentic Steerability and Risk Assessment Framework

108

22 Nov 2025

Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria

...

164

22 Nov 2025

The Impact of Off-Policy Training Data on Probe Generalisation

Dmitrii Krasheninnikov

160

21 Nov 2025

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

445

20 Nov 2025

Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security

129

20 Nov 2025

SafeRBench: Dissecting the Reasoning Safety of Large Language Models

...

304

19 Nov 2025

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

242

18 Nov 2025

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

316

16 Nov 2025

LLM Reinforcement in Context

Thomas Rivasseau

16 Nov 2025

AlignTree: Efficient Defense Against LLM Jailbreak Attacks

214

15 Nov 2025

Virtual Traffic Lights for Multi-Robot Navigation: Decentralized Planning with Centralized Conflict Resolution

...

111

11 Nov 2025

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

150

11 Nov 2025

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

162

10 Nov 2025

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Peng Zhang

Peijie Sun

437

10 Nov 2025

AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

251

06 Nov 2025

Jailbreaking in the Haystack

119

05 Nov 2025

Let the Bees Find the Weak Spots: A Path Planning Perspective on Multi-Turn Jailbreak Attacks against LLMs

05 Nov 2025

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

641

04 Nov 2025

An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

138

04 Nov 2025

LiveSecBench: A Dynamic and Event-Driven Safety Benchmark for Chinese Language Model Applications

...

Tianxin Zhang

Yue Gao

Yongfeng Huang

ELM

252

04 Nov 2025

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Hamin Koo

Minseon Kim

Jaehyung Kim

127

03 Nov 2025

Reimagining Safety Alignment with An Image

115

01 Nov 2025

Consistency Training Helps Stop Sycophancy and Jailbreaks

Alex Irpan

Alexander Matt Turner

Mark Kurzeja

David Elson

Rohin Shah

238

31 Oct 2025

Diffusion LLMs are Natural Adversaries for any LLM

223

31 Oct 2025

Angular Steering: Behavior Control via Rotation in Activation Space

Hieu M. Vu

T. Nguyen

LLMSV

354

30 Oct 2025

Chain-of-Thought Hijacking

183

30 Oct 2025

Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng

Vidhisha Balachandran

265

30 Oct 2025

Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

219

26 Oct 2025

The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

247

24 Oct 2025

Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency

197

24 Oct 2025

Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

...

189

24 Oct 2025