ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.04249
  4. Cited By
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
  and Robust Refusal
v1v2 (latest)

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

6 February 2024
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
Norman Mu
Elham Sakhaee
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
    AAML
ArXiv (abs)PDFHTMLHuggingFace (6 upvotes)Github (652★)

Papers citing "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"

50 / 487 papers shown
Effective and Efficient Adversarial Detection for Vision-Language Models
  via A Single Vector
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector
Youcheng Huang
Fengbin Zhu
Jingkun Tang
Pan Zhou
Wenqiang Lei
Jiancheng Lv
Tat-Seng Chua
AAML
178
5
0
30 Oct 2024
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to
  Jailbreak LLMs with Higher Success Rates in Fewer Attempts
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts
Vishal Kumar
Zeyi Liao
Jaylen Jones
Huan Sun
AAML
298
8
0
29 Oct 2024
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
Stealthy Jailbreak Attacks on Large Language Models via Benign Data MirroringNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Honglin Mu
Han He
Yuxin Zhou
Yunlong Feng
Yang Xu
...
Zeming Liu
Xudong Han
Qi Shi
Qingfu Zhu
Wanxiang Che
AAML
293
3
0
28 Oct 2024
Adversarial Attacks on Large Language Models Using Regularized
  Relaxation
Adversarial Attacks on Large Language Models Using Regularized Relaxation
Samuel Jacob Chacko
Sajib Biswas
Chashi Mahiul Islam
Fatema Tabassum Liza
Xiuwen Liu
AAML
252
10
0
24 Oct 2024
Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models
Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models
He Cao
Weidi Luo
Zijing Liu
Yu Wang
Bing Feng
Xingtai Lv
Yuan Yao
Yu Li
AAML
233
0
0
23 Oct 2024
SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior
SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior
Jing-Jing Li
Valentina Pyatkin
Max Kleiman-Weiner
Liwei Jiang
Nouha Dziri
Anne Collins
Jana Schaich Borg
Maarten Sap
Yejin Choi
Sydney Levine
405
0
0
22 Oct 2024
Bayesian scaling laws for in-context learning
Bayesian scaling laws for in-context learning
Aryaman Arora
Dan Jurafsky
Christopher Potts
Noah D. Goodman
548
12
0
21 Oct 2024
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against
  Aligned Large Language Models
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
Xiao-Li Li
Zhuhong Li
Qiongxiu Li
Bingze Lee
Jinghao Cui
Xiaolin Hu
AAML
129
17
0
20 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the dataInternational Conference on Learning Representations (ICLR), 2024
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELMALM
417
23
0
17 Oct 2024
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via
  Mechanistic Localization
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization
Phillip Guo
Aaquib Syed
Abhay Sheshadri
Aidan Ewart
Gintare Karolina Dziugaite
KELMMU
219
17
0
16 Oct 2024
Merge to Learn: Efficiently Adding Skills to Language Models with Model
  Merging
Merge to Learn: Efficiently Adding Skills to Language Models with Model MergingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
William Merrill
Noah A. Smith
Hannaneh Hajishirzi
Pang Wei Koh
Jesse Dodge
Pradeep Dasigi
KELMMoMeCLL
263
6
0
16 Oct 2024
Multi-round jailbreak attack on large language models
Yihua Zhou
Xiaochuan Shi
AAML
194
1
0
15 Oct 2024
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Qizhang Li
Xiaochen Yang
W. Zuo
Yiwen Guo
AAML
358
3
0
15 Oct 2024
Cognitive Overload Attack:Prompt Injection for Long Context
Cognitive Overload Attack:Prompt Injection for Long Context
Bibek Upadhayay
Vahid Behzadan
Amin Karbasi
AAML
288
13
0
15 Oct 2024
Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
Yifan Luo
Zhennan Zhou
Meitan Wang
Bin Dong
228
2
0
14 Oct 2024
On Calibration of LLM-based Guard Models for Reliable Content Moderation
On Calibration of LLM-based Guard Models for Reliable Content ModerationInternational Conference on Learning Representations (ICLR), 2024
Hongfu Liu
Hengguan Huang
Hao Wang
Xiangming Gu
Ye Wang
422
16
0
14 Oct 2024
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention
  Manipulation
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation
Zijun Wang
Haoqin Tu
J. Mei
Bingchen Zhao
Yanjie Wang
Cihang Xie
173
19
0
11 Oct 2024
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Priyanshu Kumar
Elaine Lau
Saranya Vijayakumar
Tu Trinh
Scale Red Team
...
Sean Hendryx
Shuyan Zhou
Matt Fredrikson
Summer Yue
Zifan Wang
LLMAG
244
49
0
11 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
608
1
0
09 Oct 2024
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and
  Ethical Considerations
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations
Tarun Raheja
Nilay Pochhi
AAML
240
9
0
09 Oct 2024
ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference TimeInternational Conference on Learning Representations (ICLR), 2024
Yi Ding
Bolian Li
Ruqi Zhang
MLLM
317
42
0
09 Oct 2024
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
Simon Lermen
Mateusz Dziemian
Govind Pimpale
LLMAG
225
5
0
08 Oct 2024
SoK: Towards Security and Safety of Edge AI
SoK: Towards Security and Safety of Edge AI
Tatjana Wingarz
Anne Lauscher
Janick Edinger
Dominik Kaaser
Stefan Schulte
Mathias Fischer
276
2
0
07 Oct 2024
Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks
Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak AttacksInternational Conference on Learning Representations (ICLR), 2024
Zi Wang
Divyam Anshumaan
Ashish Hooda
Yudong Chen
Somesh Jha
AAML
263
4
0
05 Oct 2024
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
Tianyu Wu
Lingrui Mei
Ruibin Yuan
Lujun Li
Wei Xue
Yike Guo
230
14
0
04 Oct 2024
Aligning LLMs with Individual Preferences via Interaction
Aligning LLMs with Individual Preferences via InteractionInternational Conference on Computational Linguistics (COLING), 2024
Shujin Wu
May Fung
Cheng Qian
Jeonghwan Kim
Dilek Z. Hakkani-Tür
Heng Ji
345
52
0
04 Oct 2024
Output Scouting: Auditing Large Language Models for Catastrophic Responses
Output Scouting: Auditing Large Language Models for Catastrophic Responses
Andrew Bell
Joao Fonseca
KELM
323
2
0
04 Oct 2024
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models
A Probabilistic Perspective on Unlearning and Alignment for Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Yan Scholten
Stephan Günnemann
Leo Schwinn
MU
744
15
0
04 Oct 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector AblationInternational Conference on Learning Representations (ICLR), 2024
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
442
24
0
04 Oct 2024
HiddenGuard: Fine-Grained Safe Generation with Specialized
  Representation Router
HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router
Lingrui Mei
Shenghua Liu
Yiwei Wang
Baolong Bi
Ruibin Yuan
Xueqi Cheng
257
10
0
03 Oct 2024
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMsInternational Conference on Learning Representations (ICLR), 2024
Xiaogeng Liu
Peiran Li
Edward Suh
Yevgeniy Vorobeychik
Zhuoqing Mao
Somesh Jha
Patrick McDaniel
Huan Sun
Bo Li
Chaowei Xiao
519
102
0
03 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang He
Yi Zeng
AAML
345
11
0
03 Oct 2024
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
Maya Pavlova
Erik Brinkman
Krithika Iyer
Vítor Albiero
Joanna Bitton
Hailey Nguyen
Haibin Zhang
Cristian Canton Ferrer
Ivan Evtimov
Aaron Grattafiori
ALM
233
29
0
02 Oct 2024
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard ModelsInternational Conference on Learning Representations (ICLR), 2024
Seanie Lee
Haebin Seong
Dong Bok Lee
Minki Kang
Xiaoyin Chen
Dominik Wagner
Yoshua Bengio
Juho Lee
Sung Ju Hwang
405
13
0
02 Oct 2024
Endless Jailbreaks with Bijection Learning
Endless Jailbreaks with Bijection LearningInternational Conference on Learning Representations (ICLR), 2024
Brian R. Y. Huang
Maximilian Li
Leonard Tang
AAML
382
14
0
02 Oct 2024
Robust LLM safeguarding via refusal feature adversarial training
Robust LLM safeguarding via refusal feature adversarial trainingInternational Conference on Learning Representations (ICLR), 2024
L. Yu
Virginie Do
Karen Hambardzumyan
Nicola Cancedda
AAML
356
40
0
30 Sep 2024
T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness
  Recognition
T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness RecognitionNeural Information Processing Systems (NeurIPS), 2024
Chen Yeh
You-Ming Chang
Wei-Chen Chiu
Ning Yu
189
3
0
29 Sep 2024
GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending
  Against Prompt Injection Attacks
GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks
Rongchang Li
Minjie Chen
Wenpeng Xing
Han Chen
Wenpeng Xing
Meng Han
SILMELM
153
7
0
29 Sep 2024
Overriding Safety protections of Open-source Models
Overriding Safety protections of Open-source Models
Sachin Kumar
120
5
0
28 Sep 2024
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A
  Survey
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
AAML
485
76
0
26 Sep 2024
Holistic Automated Red Teaming for Large Language Models through
  Top-Down Test Case Generation and Multi-turn Interaction
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn InteractionConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Jinchuan Zhang
Yan Zhou
Yaxin Liu
Ziming Li
Songlin Hu
AAML
229
16
0
25 Sep 2024
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in
  Red Teaming GenAI
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI
Ambrish Rawat
Stefan Schoepf
Giulio Zizzo
Giandomenico Cornacchia
Muhammad Zaid Hameed
...
Elizabeth M. Daly
Mark Purcell
P. Sattigeri
Pin-Yu Chen
Kush R. Varshney
AAML
221
14
0
23 Sep 2024
Backtracking Improves Generation Safety
Backtracking Improves Generation Safety
Yiming Zhang
Jianfeng Chi
Hailey Nguyen
Kartikeya Upasani
Daniel M. Bikel
Jason Weston
Eric Michael Smith
SILM
313
24
0
22 Sep 2024
Jailbreaking Large Language Models with Symbolic Mathematics
Jailbreaking Large Language Models with Symbolic Mathematics
Emet Bethany
Mazal Bethany
Juan Arturo Nolazco Flores
S. Jha
Peyman Najafirad
AAML
208
10
0
17 Sep 2024
Securing Vision-Language Models with a Robust Encoder Against Jailbreak
  and Adversarial Attacks
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial AttacksBigData Congress [Services Society] (BSS), 2024
Md Zarif Hossain
Ahmed Imteaj
AAMLVLM
261
13
0
11 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILMAAML
351
9
0
05 Sep 2024
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Bang An
Sicheng Zhu
Ruiyi Zhang
Michael-Andrei Panaitescu-Liess
Yuancheng Xu
Furong Huang
AAML
391
29
0
01 Sep 2024
Legilimens: Practical and Unified Content Moderation for Large Language
  Model Services
Legilimens: Practical and Unified Content Moderation for Large Language Model ServicesConference on Computer and Communications Security (CCS), 2024
Jialin Wu
Jiangyi Deng
Shengyuan Pang
Yanjiao Chen
Jiayang Xu
Xinfeng Li
Wei Dong
361
13
0
28 Aug 2024
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Nathaniel Li
Ziwen Han
Ian Steneker
Willow Primack
Riley Goodside
Hugh Zhang
Zifan Wang
Cristina Menghini
Summer Yue
AAMLMU
291
106
0
27 Aug 2024
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language
  Models
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Hongfu Liu
Yuxi Xie
Ye Wang
Michael Shieh
230
8
0
27 Aug 2024
Previous
123...10789
Next
Page 8 of 10