Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2402.04249
Cited By
v1
v2 (latest)
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
6 February 2024
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
Norman Mu
Elham Sakhaee
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (6 upvotes)
Github (652★)
Papers citing
"HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"
50 / 487 papers shown
Are Your Agents Upward Deceivers?
Dadi Guo
Qingyu Liu
Dongrui Liu
Qihan Ren
Shuai Shao
...
Z. Chen
Jialing Tao
Yaodong Yang
Jing Shao
Xia Hu
LLMAG
163
0
0
04 Dec 2025
Context-Aware Hierarchical Learning: A Two-Step Paradigm towards Safer LLMs
Tengyun Ma
Jiaqi Yao
Daojing He
Shihao Peng
Yu Li
Shaohui Liu
Zhuotao Tian
117
0
0
03 Dec 2025
RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories
Roy Rinberg
Usha Bhalla
Igor Shilov
Flavio du Pin Calmon
Rohit Gandikota
KELM
MU
275
0
0
03 Dec 2025
Lumos: Let there be Language Model System Certification
Isha Chaudhary
Vedaant V. Jain
Avaljot Singh
Kavya Sachdeva
Sayan Ranu
Gagandeep Singh
72
0
0
02 Dec 2025
CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer
Lavish Bansal
Naman Mishra
73
0
0
02 Dec 2025
A Safety and Security Framework for Real-World Agentic Systems
Shaona Ghosh
Barnaby Simkin
Kyriacos Shiarlis
Soumili Nandi
Dan Zhao
...
Nikki Pope
Roopa Prabhu
Daniel Rohrer
Michael Demoret
Bartley Richardson
37
0
0
27 Nov 2025
Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks
Richard J. Young
ELM
148
0
0
27 Nov 2025
Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression
Tianyu Zhang
Zihang Xi
Jingyu Hua
Sheng Zhong
55
0
0
27 Nov 2025
Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs
Dongkyu Derek Cho
Huan Song
Arijit Ghosh Chowdhury
Haotian An
Y. X. R. Wang
Rohit Thekkanal
Negin Sokhandan
Sharlina Keshava
Hannah R Marlowe
147
0
0
26 Nov 2025
InvisibleBench: A Deployment Gate for Caregiving Relationship AI
Ali Madad
124
0
0
25 Nov 2025
SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models
Mohammed Talha Alam
Nada Saadi
Fahad Shamshad
Nils Lukas
Karthik Nandakumar
Fahkri Karray
Samuele Poppi
EGVM
169
0
0
24 Nov 2025
Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion
Yu Cui
Yifei Liu
Hang Fu
Sicheng Pan
Haibin Zhang
Cong Zuo
Licheng Wang
186
0
0
24 Nov 2025
Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
Xurui Li
Kaisong Song
Rui Zhu
Pin-Yu Chen
Haixu Tang
AAML
458
1
0
24 Nov 2025
Automating Deception: Scalable Multi-Turn LLM Jailbreaks
Adarsh Kumarappan
Ananya Mujoo
AAML
94
1
0
24 Nov 2025
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
Junbo Zhang
Ran Chen
Qianli Zhou
Xinyang Deng
Wen Jiang
173
1
0
24 Nov 2025
FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models
M. Fatehkia
Enes Altinisik
Husrev Taha Sencar
114
0
0
24 Nov 2025
TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization
Yanting Wang
Runpeng Geng
Jinghui Chen
Minhao Cheng
Jinyuan Jia
297
0
0
23 Nov 2025
Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria
Kartik Garg
Shourya Mishra
Kartikeya Sinha
Ojaswi Pratap Singh
Ayush Chopra
...
Ammar Sheikh
Raghav Maheshwari
Aman Chadha
Vinija Jain
Amitava Das
OffRL
161
0
0
22 Nov 2025
ASTRA: Agentic Steerability and Risk Assessment Framework
Itay Hazan
Yael Mathov
Guy Shtar
Ron Bitton
Itsik Mantin
100
0
0
22 Nov 2025
The Impact of Off-Policy Training Data on Probe Generalisation
Nathalie Kirch
Samuel Dower
Adrians Skapars
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
143
0
0
21 Nov 2025
Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security
Wei Zhao
Zhe Li
Yige Li
Jun Sun
AAML
115
1
0
20 Nov 2025
AutoBackdoor: Automating Backdoor Attacks via LLM Agents
Y. Li
Z. Li
Wei Zhao
Nay Myat Min
Hanxun Huang
Xingjun Ma
Jun Sun
AAML
LLMAG
SILM
398
1
0
20 Nov 2025
SafeRBench: Dissecting the Reasoning Safety of Large Language Models
Xin Gao
S. Yu
Z. Chen
Yueming Lyu
W. Yu
...
Jiyao Liu
Jianxiong Gao
Jian Liang
Ziwei Liu
Chenyang Si
ELM
LRM
267
1
0
19 Nov 2025
Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education
Xin Yi
Yue Li
Dongsheng Shi
Linlin Wang
Xiaoling Wang
Liang He
AAML
221
0
0
18 Nov 2025
LLM Reinforcement in Context
Thomas Rivasseau
79
0
0
16 Nov 2025
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
Yunhao Chen
Xin Wang
Juncheng Li
Yixu Wang
Jie Li
Yan Teng
Yingchun Wang
Xingjun Ma
AAML
293
1
0
16 Nov 2025
AlignTree: Efficient Defense Against LLM Jailbreak Attacks
Gil Goren
Shahar Katz
Lior Wolf
AAML
199
1
0
15 Nov 2025
Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs
Yuxuan Zhou
Yuzhao Peng
Yang Bai
Kuofeng Gao
Yihao Zhang
Yechao Zhang
Xun Chen
Tao Yu
Tao Dai
Shu-Tao Xia
AAML
138
0
0
11 Nov 2025
Virtual Traffic Lights for Multi-Robot Navigation: Decentralized Planning with Centralized Conflict Resolution
Sagar Gupta
Thanh Vinh Nguyen
Thieu Long Phan
Vidul Attri
Archit Gupta
...
Kevin Lee
S. W. Loke
Ronny Kutadinata
Benjamin Champion
Akansel Cosgun
92
0
0
11 Nov 2025
Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment
Peng Zhang
Peijie Sun
404
0
0
10 Nov 2025
JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework
Yuxuan Zhou
Yang Bai
Kuofeng Gao
Tao Dai
Shu-Tao Xia
141
0
0
10 Nov 2025
AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research
Tim Beyer
Jonas Dornbusch
Jakob Steimle
Moritz Ladenburger
Leo Schwinn
Stephan Günnemann
AAML
242
0
0
06 Nov 2025
Jailbreaking in the Haystack
Rishi Rajesh Shah
Chen Henry Wu
Shashwat Saxena
Ziqian Zhong
Alexander Robey
Aditi Raghunathan
114
0
0
05 Nov 2025
Let the Bees Find the Weak Spots: A Path Planning Perspective on Multi-Turn Jailbreak Attacks against LLMs
Yize Liu
Yunyun Hou
Aina Sui
AAML
93
0
0
05 Nov 2025
LiveSecBench: A Dynamic and Event-Driven Safety Benchmark for Chinese Language Model Applications
Yudong Li
Zhongliang Yang
Kejiang Chen
Wenxuan Wang
TianXin Zhang
...
Xingchi Gu
Peiru Yang
Tianxin Zhang
Yue Gao
Yongfeng Huang
ELM
243
0
0
04 Nov 2025
An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks
Xu Liu
Yan Chen
Kan Ling
Yichi Zhu
Hengrun Zhang
Guisheng Fan
Huiqun Yu
AAML
121
2
0
04 Nov 2025
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Aashray Reddy
Andrew Zagula
Nicholas Saban
AAML
MU
SILM
618
3
0
04 Nov 2025
Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
Hamin Koo
Minseon Kim
Jaehyung Kim
104
1
0
03 Nov 2025
Reimagining Safety Alignment with An Image
Yifan Xia
Guorui Chen
Wenqian Yu
Zhijiang Li
Philip Torr
Jindong Gu
111
0
0
01 Nov 2025
Diffusion LLMs are Natural Adversaries for any LLM
David Lüdke
Tom Wollschlager
Paul Ungermann
Stephan Günnemann
Leo Schwinn
DiffM
199
1
0
31 Oct 2025
Consistency Training Helps Stop Sycophancy and Jailbreaks
Alex Irpan
Alexander Matt Turner
Mark Kurzeja
David Elson
Rohin Shah
237
0
0
31 Oct 2025
Angular Steering: Behavior Control via Rotation in Activation Space
Hieu M. Vu
T. Nguyen
LLMSV
338
4
0
30 Oct 2025
Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng
Vidhisha Balachandran
Chan Young Park
Faeze Brahman
Sachin Kumar
LRM
252
0
0
30 Oct 2025
Chain-of-Thought Hijacking
Jianli Zhao
Tingchen Fu
Rylan Schaeffer
Mrinank Sharma
Fazl Barez
LRM
165
1
0
30 Oct 2025
Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents
Julia Bazińska
Max Mathys
Francesco Casucci
Mateo Rojas-Carulla
Xander Davies
Alexandra Souly
Niklas Pfister
LLMAG
ELM
188
0
0
26 Oct 2025
The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning
Mingrui Liu
Sixiao Zhang
Cheng Long
Kwok Yan Lam
SILM
230
0
0
24 Oct 2025
Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
Mahavir Dabas
Tran Ngoc Huynh
Nikhil Reddy Billa
Jiachen T. Wang
Peng Gao
...
Yao Ma
Rahul Gupta
Ming Jin
Prateek Mittal
R. Jia
AAML
174
0
0
24 Oct 2025
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency
Yukun Jiang
Mingjie Li
Michael Backes
Yang Zhang
186
3
0
24 Oct 2025
AI PB: A Grounded Generative Agent for Personalized Investment Insights
Daewoo Park
Suho Park
Inseok Hong
Hanwool Lee
Junkyu Park
Sangjun Lee
Jeongman An
Hyunbin Loh
AIFin
137
0
0
23 Oct 2025
Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation
Ming Li
KELM
142
0
0
21 Oct 2025
1
2
3
4
...
8
9
10
Next
Page 1 of 10