ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.04249
  4. Cited By
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
  and Robust Refusal
v1v2 (latest)

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

6 February 2024
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
Norman Mu
Elham Sakhaee
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
    AAML
ArXiv (abs)PDFHTMLHuggingFace (6 upvotes)Github (652★)

Papers citing "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"

50 / 487 papers shown
Many-Turn Jailbreaking
Many-Turn Jailbreaking
Xianjun Yang
Liqiang Xiao
Shiyang Li
Faisal Ladhak
Hyokun Yun
Linda R. Petzold
Yi Xu
William Wang
151
0
0
09 Aug 2025
SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation
SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation
Lai Jiang
Yuekang Li
Xiaohan Zhang
Youtao Ding
Li Pan
117
0
0
08 Aug 2025
Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining
Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining
Bing Han
Feifei Zhao
Dongcheng Zhao
Guobin Shen
Ping Wu
Yu Shi
Yi Zeng
217
1
0
08 Aug 2025
Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
Matteo Prandi
Vincenzo Suriani
Federico Pierucci
Marcello Galisai
Daniele Nardi
Piercosma Bisconti
ELM
114
0
0
07 Aug 2025
Building Effective Safety Guardrails in AI Education Tools
Building Effective Safety Guardrails in AI Education Tools
Hannah-Beth Clark
Laura Benton
Emma Searle
Margaux Dowland
Matthew Gregory
Will Gayne
John Roberts
16
2
0
07 Aug 2025
Automatic LLM Red Teaming
Automatic LLM Red Teaming
Roman Belaire
Arunesh Sinha
Pradeep Varakantham
LLMAG
193
0
0
06 Aug 2025
Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety
Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety
Zhenyu Pan
Xicheng Zhang
Y. Zhang
Jianshu Zhang
Haozheng Luo
...
Dennis Wu
Hong-Yu Chen
Philip S. Yu
Manling Li
Han Liu
AAML
222
3
0
05 Aug 2025
RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging
RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging
The-Hai Nguyen
Dang Huu-Tien
Takeshi Suzuki
Le-Minh Nguyen
MoMe
291
2
0
05 Aug 2025
Activation-Guided Local Editing for Jailbreaking Attacks
Activation-Guided Local Editing for Jailbreaking Attacks
Jiecong Wang
Haoran Li
Hao Peng
Ziqian Zeng
Zihao Wang
Haohua Du
Zhengtao Yu
AAML
224
0
0
01 Aug 2025
Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report
Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report
Sajana Weerawardhena
Paul Kassianik
Blaine Nelson
Baturay Saglam
Anu Vellore
...
Dhruv Kedia
Kojin Oshiba
Zhouran Yang
Yaron Singer
Amin Karbasi
ALMELM
190
6
0
01 Aug 2025
Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs
Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs
Ziqian Zhong
Aditi Raghunathan
208
3
0
31 Jul 2025
Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
Xikang Yang
Biyu Zhou
Xuehai Tang
Jizhong Han
Songlin Hu
AAML
168
0
0
30 Jul 2025
UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases
UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases
Raj Vardhan Tomar
Preslav Nakov
Yuxia Wang
LRM
260
3
0
29 Jul 2025
Libra: Large Chinese-based Safeguard for AI Content
Libra: Large Chinese-based Safeguard for AI Content
Ziyang Chen
Huimu Yu
Xing Wu
Dongqin Liu
Songlin Hu
AILaw
146
1
0
29 Jul 2025
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
Andy Zou
Maxwell Lin
Eliot Krzysztof Jones
Micha Nowak
Mateusz Dziemian
...
Nate Burnikell
Yarin Gal
Dan Hendrycks
J. Zico Kolter
Matt Fredrikson
LLMAGAAMLELM
159
7
0
28 Jul 2025
The Blessing and Curse of Dimensionality in Safety Alignment
The Blessing and Curse of Dimensionality in Safety Alignment
R. Teo
Laziz U. Abdullaev
Tan M. Nguyen
241
5
0
27 Jul 2025
PrompTrend: Continuous Community-Driven Vulnerability Discovery and Assessment for Large Language Models
PrompTrend: Continuous Community-Driven Vulnerability Discovery and Assessment for Large Language Models
Tarek Gasmi
Ramzi Guesmi
Mootez Aloui
Jihene Bennaceur
205
0
0
25 Jul 2025
MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?
MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?
Muntasir Wahed
Xiaona Zhou
Kiet A. Nguyen
Tianjiao Yu
Nirav Diwan
Gang Wang
Dilek Hakkani-Tür
Ismini Lourentzou
AAML
171
1
0
25 Jul 2025
PurpCode: Reasoning for Safer Code Generation
PurpCode: Reasoning for Safer Code Generation
Jiawei Liu
Nirav Diwan
Zhe Wang
Haoyu Zhai
Xiaona Zhou
...
Hadjer Benkraouda
Yuxiang Wei
Lingming Zhang
Ismini Lourentzou
Gang Wang
SILMLRMELM
448
7
0
25 Jul 2025
Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
Hao Li
Lijun Li
Zhenghao Lu
Xianyi Wei
Rui Li
Jing Shao
Lei Sha
394
11
0
24 Jul 2025
Towards Unifying Quantitative Security Benchmarking for Multi Agent Systems
Towards Unifying Quantitative Security Benchmarking for Multi Agent Systems
Gauri Sharma
Vidhi Kulkarni
Miles King
Ken Huang
144
0
0
23 Jul 2025
The Geometry of Harmfulness in LLMs through Subconcept Probing
The Geometry of Harmfulness in LLMs through Subconcept Probing
McNair Shah
Saleena Angeline
Adhitya Rajendra Kumar
Naitik Chheda
Kevin Zhu
Sean O Brien
Sean O'Brien
Will Cai
LLMSV
239
3
0
23 Jul 2025
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Yik Siu Chan
Zheng-Xin Yong
Stephen H. Bach
LRM
254
9
0
16 Jul 2025
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
Zhengyue Zhao
Yingzi Ma
S. Jha
Marco Pavone
P. McDaniel
Chaowei Xiao
LRM
207
2
0
14 Jul 2025
Large Language Models Encode Semantics and Alignment in Linearly Separable Representations
Large Language Models Encode Semantics and Alignment in Linearly Separable Representations
Baturay Saglam
Paul Kassianik
Blaine Nelson
Sajana Weerawardhena
Yaron Singer
Amin Karbasi
175
3
0
13 Jul 2025
Attention-Aware GNN-based Input Defense against Multi-Turn LLM Jailbreak
Attention-Aware GNN-based Input Defense against Multi-Turn LLM Jailbreak
Zixuan Huang
Kecheng Huang
Lihao Yin
Bowei He
Huiling Zhen
Mingxuan Yuan
Zili Shao
AAML
381
0
0
09 Jul 2025
PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
Krishna Kanth Nakka
Xue Jiang
Dmitrii Usynin
Xuebing Zhou
LLMSV
250
1
0
03 Jul 2025
Reasoning as an Adaptive Defense for Safety
Reasoning as an Adaptive Defense for Safety
Taeyoun Kim
Fahim Tajwar
Aditi Raghunathan
Aviral Kumar
LRM
176
9
0
01 Jul 2025
VERA: Variational Inference Framework for Jailbreaking Large Language Models
VERA: Variational Inference Framework for Jailbreaking Large Language Models
Anamika Lochab
Lu Yan
Patrick Pynadath
Xiangyu Zhang
Ruqi Zhang
AAMLVLM
377
1
0
27 Jun 2025
RedCoder: Automated Multi-Turn Red Teaming for Code LLMs
RedCoder: Automated Multi-Turn Red Teaming for Code LLMs
Wenjie Mo
Qin Liu
Xiaofei Wen
Dongwon Jung
Hadi Askari
Wenxuan Zhou
Zhe Zhao
Muhao Chen
LLMAGAAML
177
3
1
25 Jun 2025
A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures
A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures
Dezhang Kong
Shi Lin
Zhenhua Xu
Z. J. Wang
Minghao Li
...
Ningyu Zhang
Chaochao Chen
Chunming Wu
Muhammad Khurram Khan
Meng Han
LLMAG
359
28
0
24 Jun 2025
GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication
GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication
Hua Tang
Lingyong Yan
Yukun Zhao
Shuaiqiang Wang
J. Huang
Dawei Yin
186
1
0
22 Jun 2025
From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers
From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers
Jingtong Su
Julia Kempe
Karen Ullrich
276
3
0
20 Jun 2025
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
Lei Jiang
Zixun Zhang
Zizhou Wang
Xiaobing Sun
Zhen Li
Liangli Zhen
Xiaohua Xu
AAML
234
2
0
20 Jun 2025
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
ZhengLin Lai
MengYao Liao
Bingzhe Wu
Dong Xu
Zebin Zhao
Zhihang Yuan
Chao Fan
Jianqiang Li
MoE
205
3
0
20 Jun 2025
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Kartik Sharma
Yiqiao Jin
Vineeth Rakesh
Yingtong Dou
Menghai Pan
Mahashweta Das
Srijan Kumar
AAML
238
0
0
18 Jun 2025
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Thomas Kuntz
Agatha Duzan
Hao Zhao
Francesco Croce
Zico Kolter
Nicolas Flammarion
Maksym Andriushchenko
LLMAGELM
314
18
0
17 Jun 2025
Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
Youze Wang
Zijun Chen
Ruoyu Chen
Shishen Gu
Yinpeng Dong
...
Jun Zhu
Meng Wang
Richang Hong
Wenbo Hu
Richang Hong
365
0
0
14 Jun 2025
QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
Taegyeong Lee
Jeonghwa Yoo
Hyoungseo Cho
Soo Yong Kim
Yunho Maeng
AAML
279
2
0
14 Jun 2025
Improving Large Language Model Safety with Contrastive Representation Learning
Improving Large Language Model Safety with Contrastive Representation Learning
Samuel Simko
Mrinmaya Sachan
Bernhard Schölkopf
Zhijing Jin
AAML
377
3
0
13 Jun 2025
InfoFlood: Jailbreaking Large Language Models with Information Overload
InfoFlood: Jailbreaking Large Language Models with Information Overload
Advait Yadav
Haibo Jin
Man Luo
Jun Zhuang
Haohan Wang
AAML
216
3
0
13 Jun 2025
How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
Sohee Yang
Sang-Woo Lee
Nora Kassner
Daniela Gottesman
Sebastian Riedel
Mor Geva
LRM
383
4
0
12 Jun 2025
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Hao Peng
Yunjia Qi
Xiaozhi Wang
Bin Xu
Lei Hou
Juanzi Li
OffRL
306
9
0
11 Jun 2025
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
Polina Kirichenko
Mark Ibrahim
Kamalika Chaudhuri
Samuel J. Bell
LRM
207
26
0
10 Jun 2025
AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin
Shuo Yang
Qihui Zhang
Yuyang Liu
Yue Huang
Xiaojun Jia
Kunpeng Ning
Jiayu Yao
Jigang Wang
Hailiang Dai
Yibing Song
275
10
0
10 Jun 2025
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts
T. Krauß
Hamid Dashtbani
Alexandra Dmitrienko
157
6
0
09 Jun 2025
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models
Yifan Luo
Zhennan Zhou
Bin Dong
177
0
0
09 Jun 2025
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
Mickel Liu
L. Jiang
Yancheng Liang
S. Du
Yejin Choi
Tim Althoff
Natasha Jaques
AAMLLRM
319
15
0
09 Jun 2025
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
Leheng Sheng
Changshuo Shen
Weixiang Zhao
Junfeng Fang
Xiaohao Liu
Zhenkai Liang
Xiang Wang
An Zhang
Tat-Seng Chua
LLMSV
157
9
0
08 Jun 2025
Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values
Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values
Nell Watson
Ahmed Amer
Evan Harris
Preeti Ravindra
Shujun Zhang
233
1
0
08 Jun 2025
Previous
12345...8910
Next
Page 4 of 10