ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.04249
  4. Cited By
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
  and Robust Refusal
v1v2 (latest)

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

6 February 2024
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
Norman Mu
Elham Sakhaee
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
    AAML
ArXiv (abs)PDFHTMLHuggingFace (6 upvotes)Github (652★)

Papers citing "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"

50 / 490 papers shown
Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models
Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models
Mahesh Kumar Nandwana
Youngwan Lim
Joseph Liu
Alex Yang
Varun Notibala
Nishchaie Khanna
KELMELM
347
0
0
05 Dec 2025
Matching Ranks Over Probability Yields Truly Deep Safety Alignment
Matching Ranks Over Probability Yields Truly Deep Safety Alignment
Jason Vega
Gagandeep Singh
AAML
92
0
0
05 Dec 2025
Are Your Agents Upward Deceivers?
Are Your Agents Upward Deceivers?
Dadi Guo
Qingyu Liu
Dongrui Liu
Qihan Ren
Shuai Shao
...
Z. Chen
Jialing Tao
Yaodong Yang
Jing Shao
Xia Hu
LLMAG
179
2
0
04 Dec 2025
Context-Aware Hierarchical Learning: A Two-Step Paradigm towards Safer LLMs
Context-Aware Hierarchical Learning: A Two-Step Paradigm towards Safer LLMs
Tengyun Ma
Jiaqi Yao
Daojing He
Shihao Peng
Yu Li
Shaohui Liu
Zhuotao Tian
177
0
0
03 Dec 2025
RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories
RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories
Roy Rinberg
Usha Bhalla
Igor Shilov
Flavio du Pin Calmon
Rohit Gandikota
KELMMU
283
0
0
03 Dec 2025
CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer
CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer
Lavish Bansal
Naman Mishra
73
0
0
02 Dec 2025
Lumos: Let there be Language Model System Certification
Lumos: Let there be Language Model System Certification
Isha Chaudhary
Vedaant V. Jain
Avaljot Singh
Kavya Sachdeva
Sayan Ranu
Gagandeep Singh
90
0
0
02 Dec 2025
Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression
Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression
Tianyu Zhang
Zihang Xi
Jingyu Hua
Sheng Zhong
59
0
0
27 Nov 2025
A Safety and Security Framework for Real-World Agentic Systems
A Safety and Security Framework for Real-World Agentic Systems
Shaona Ghosh
Barnaby Simkin
Kyriacos Shiarlis
Soumili Nandi
Dan Zhao
...
Nikki Pope
Roopa Prabhu
Daniel Rohrer
Michael Demoret
Bartley Richardson
43
2
0
27 Nov 2025
Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks
Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks
Richard J. Young
ELM
155
0
0
27 Nov 2025
Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs
Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs
Dongkyu Derek Cho
Huan Song
Arijit Ghosh Chowdhury
Haotian An
Y. X. R. Wang
Rohit Thekkanal
Negin Sokhandan
Sharlina Keshava
Hannah R Marlowe
164
1
0
26 Nov 2025
InvisibleBench: A Deployment Gate for Caregiving Relationship AI
InvisibleBench: A Deployment Gate for Caregiving Relationship AI
Ali Madad
132
0
0
25 Nov 2025
Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion
Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion
Yu Cui
Yifei Liu
Hang Fu
Sicheng Pan
Haibin Zhang
Cong Zuo
Licheng Wang
227
1
0
24 Nov 2025
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
Junbo Zhang
Ran Chen
Qianli Zhou
Xinyang Deng
Wen Jiang
185
1
0
24 Nov 2025
Automating Deception: Scalable Multi-Turn LLM Jailbreaks
Automating Deception: Scalable Multi-Turn LLM Jailbreaks
Adarsh Kumarappan
Ananya Mujoo
AAML
128
1
0
24 Nov 2025
SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models
SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models
Mohammed Talha Alam
Nada Saadi
Fahad Shamshad
Nils Lukas
Karthik Nandakumar
Fahkri Karray
Samuele Poppi
EGVM
192
0
0
24 Nov 2025
Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
Xurui Li
Kaisong Song
Rui Zhu
Pin-Yu Chen
Haixu Tang
AAML
459
2
0
24 Nov 2025
FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models
FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models
M. Fatehkia
Enes Altinisik
Husrev Taha Sencar
116
0
0
24 Nov 2025
TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization
TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization
Yanting Wang
Runpeng Geng
Jinghui Chen
Minhao Cheng
Jinyuan Jia
304
0
0
23 Nov 2025
ASTRA: Agentic Steerability and Risk Assessment Framework
ASTRA: Agentic Steerability and Risk Assessment Framework
Itay Hazan
Yael Mathov
Guy Shtar
Ron Bitton
Itsik Mantin
108
0
0
22 Nov 2025
Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria
Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria
Kartik Garg
Shourya Mishra
Kartikeya Sinha
Ojaswi Pratap Singh
Ayush Chopra
...
Ammar Sheikh
Raghav Maheshwari
Aman Chadha
Vinija Jain
Amitava Das
OffRL
164
0
0
22 Nov 2025
The Impact of Off-Policy Training Data on Probe Generalisation
The Impact of Off-Policy Training Data on Probe Generalisation
Nathalie Kirch
Samuel Dower
Adrians Skapars
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
160
0
0
21 Nov 2025
AutoBackdoor: Automating Backdoor Attacks via LLM Agents
AutoBackdoor: Automating Backdoor Attacks via LLM Agents
Y. Li
Z. Li
Wei Zhao
Nay Myat Min
Hanxun Huang
Xingjun Ma
Jun Sun
AAMLLLMAGSILM
445
1
0
20 Nov 2025
Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security
Wei Zhao
Zhe Li
Yige Li
Jun Sun
AAML
129
1
0
20 Nov 2025
SafeRBench: Dissecting the Reasoning Safety of Large Language Models
SafeRBench: Dissecting the Reasoning Safety of Large Language Models
Xin Gao
S. Yu
Z. Chen
Yueming Lyu
W. Yu
...
Jiyao Liu
Jianxiong Gao
Jian Liang
Ziwei Liu
Chenyang Si
ELMLRM
304
1
0
19 Nov 2025
Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education
Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education
Xin Yi
Yue Li
Dongsheng Shi
Linlin Wang
Xiaoling Wang
Liang He
AAML
242
1
0
18 Nov 2025
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
Yunhao Chen
Xin Wang
Juncheng Li
Yixu Wang
Jie Li
Yan Teng
Yingchun Wang
Xingjun Ma
AAML
316
1
0
16 Nov 2025
LLM Reinforcement in Context
LLM Reinforcement in Context
Thomas Rivasseau
90
0
0
16 Nov 2025
AlignTree: Efficient Defense Against LLM Jailbreak Attacks
AlignTree: Efficient Defense Against LLM Jailbreak Attacks
Gil Goren
Shahar Katz
Lior Wolf
AAML
214
1
0
15 Nov 2025
Virtual Traffic Lights for Multi-Robot Navigation: Decentralized Planning with Centralized Conflict Resolution
Virtual Traffic Lights for Multi-Robot Navigation: Decentralized Planning with Centralized Conflict Resolution
Sagar Gupta
Thanh Vinh Nguyen
Thieu Long Phan
Vidul Attri
Archit Gupta
...
Kevin Lee
S. W. Loke
Ronny Kutadinata
Benjamin Champion
Akansel Cosgun
111
0
0
11 Nov 2025
Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs
Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs
Yuxuan Zhou
Yuzhao Peng
Yang Bai
Kuofeng Gao
Yihao Zhang
Yechao Zhang
Xun Chen
Tao Yu
Tao Dai
Shu-Tao Xia
AAML
150
0
0
11 Nov 2025
JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework
JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework
Yuxuan Zhou
Yang Bai
Kuofeng Gao
Tao Dai
Shu-Tao Xia
162
0
0
10 Nov 2025
Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment
Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment
Peng Zhang
Peijie Sun
437
1
0
10 Nov 2025
AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research
AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research
Tim Beyer
Jonas Dornbusch
Jakob Steimle
Moritz Ladenburger
Leo Schwinn
Stephan Günnemann
AAML
251
1
0
06 Nov 2025
Jailbreaking in the Haystack
Jailbreaking in the Haystack
Rishi Rajesh Shah
Chen Henry Wu
Shashwat Saxena
Ziqian Zhong
Alexander Robey
Aditi Raghunathan
119
2
0
05 Nov 2025
Let the Bees Find the Weak Spots: A Path Planning Perspective on Multi-Turn Jailbreak Attacks against LLMs
Let the Bees Find the Weak Spots: A Path Planning Perspective on Multi-Turn Jailbreak Attacks against LLMs
Yize Liu
Yunyun Hou
Aina Sui
AAML
98
0
0
05 Nov 2025
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Aashray Reddy
Andrew Zagula
Nicholas Saban
AAMLMUSILM
641
5
0
04 Nov 2025
An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks
An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks
Xu Liu
Yan Chen
Kan Ling
Yichi Zhu
Hengrun Zhang
Guisheng Fan
Huiqun Yu
AAML
138
2
0
04 Nov 2025
LiveSecBench: A Dynamic and Event-Driven Safety Benchmark for Chinese Language Model Applications
LiveSecBench: A Dynamic and Event-Driven Safety Benchmark for Chinese Language Model Applications
Yudong Li
Zhongliang Yang
Kejiang Chen
Wenxuan Wang
TianXin Zhang
...
Xingchi Gu
Peiru Yang
Tianxin Zhang
Yue Gao
Yongfeng Huang
ELM
252
0
0
04 Nov 2025
Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
Hamin Koo
Minseon Kim
Jaehyung Kim
127
1
0
03 Nov 2025
Reimagining Safety Alignment with An Image
Reimagining Safety Alignment with An Image
Yifan Xia
Guorui Chen
Wenqian Yu
Zhijiang Li
Philip Torr
Jindong Gu
115
1
0
01 Nov 2025
Consistency Training Helps Stop Sycophancy and Jailbreaks
Consistency Training Helps Stop Sycophancy and Jailbreaks
Alex Irpan
Alexander Matt Turner
Mark Kurzeja
David Elson
Rohin Shah
238
0
0
31 Oct 2025
Diffusion LLMs are Natural Adversaries for any LLM
Diffusion LLMs are Natural Adversaries for any LLM
David Lüdke
Tom Wollschlager
Paul Ungermann
Stephan Günnemann
Leo Schwinn
DiffM
223
2
0
31 Oct 2025
Angular Steering: Behavior Control via Rotation in Activation Space
Angular Steering: Behavior Control via Rotation in Activation Space
Hieu M. Vu
T. Nguyen
LLMSV
354
7
0
30 Oct 2025
Chain-of-Thought Hijacking
Chain-of-Thought Hijacking
Jianli Zhao
Tingchen Fu
Rylan Schaeffer
Mrinank Sharma
Fazl Barez
LRM
183
3
0
30 Oct 2025
Reasoning Up the Instruction Ladder for Controllable Language Models
Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng
Vidhisha Balachandran
Chan Young Park
Faeze Brahman
Sachin Kumar
LRM
265
0
0
30 Oct 2025
Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents
Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents
Julia Bazińska
Max Mathys
Francesco Casucci
Mateo Rojas-Carulla
Xander Davies
Alexandra Souly
Niklas Pfister
LLMAGELM
219
0
0
26 Oct 2025
The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning
The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning
Mingrui Liu
Sixiao Zhang
Cheng Long
Kwok Yan Lam
SILM
247
1
0
24 Oct 2025
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency
Yukun Jiang
Mingjie Li
Michael Backes
Yang Zhang
197
4
0
24 Oct 2025
Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
Mahavir Dabas
Tran Ngoc Huynh
Nikhil Reddy Billa
Jiachen T. Wang
Peng Gao
...
Yao Ma
Rahul Gupta
Ming Jin
Prateek Mittal
R. Jia
AAML
189
0
0
24 Oct 2025
1234...8910
Next
Page 1 of 10