Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2402.04249
Cited By
v1
v2 (latest)
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
6 February 2024
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
Norman Mu
Elham Sakhaee
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (6 upvotes)
Github (652★)
Papers citing
"HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"
50 / 487 papers shown
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
International Conference on Learning Representations (ICLR), 2024
Wenxuan Zhang
Juil Sock
Mohamed Elhoseiny
Adel Bibi
505
23
0
27 Aug 2024
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
Kexin Chen
Yi Liu
Donghai Hong
Jiaying Chen
Wenhai Wang
172
6
0
18 Aug 2024
MMJ-Bench
\textit{MMJ-Bench}
MMJ-Bench
: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models
Fenghua Weng
Yue Xu
Chengyan Fu
Wenjie Wang
AAML
233
1
0
16 Aug 2024
Large language models can consistently generate high-quality content for election disinformation operations
PLoS ONE (PLoS ONE), 2024
Angus R. Williams
Liam Burke-Moore
Ryan Sze-Yin Chan
Florence E. Enock
Federico Nanni
Tvesha Sippy
Yi-Ling Chung
Evelina Gabasova
Kobi Hackenburg
Jonathan Bright
154
13
0
13 Aug 2024
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Prannaya Gupta
Le Qi Yau
Hao Han Low
I-Shiang Lee
Hugo Maximus Lim
...
Jia Hng Koh
Dar Win Liew
Rishabh Bhardwaj
Rajat Bhardwaj
Soujanya Poria
ELM
LM&MA
240
14
0
07 Aug 2024
SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models
AAAI Conference on Artificial Intelligence (AAAI), 2024
Muxi Diao
Rumei Li
Shiyang Liu
Guogang Liao
Jingang Wang
Xunliang Cai
Weiran Xu
AAML
241
4
0
05 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Neural Information Processing Systems (NeurIPS), 2024
Jingtong Su
Mingyu Lee
SangKeun Lee
217
24
0
02 Aug 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
International Conference on Learning Representations (ICLR), 2024
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
470
108
0
01 Aug 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Richard Ren
Steven Basart
Adam Khoja
Alice Gatti
Long Phan
...
Alexander Pan
Gabriel Mukobi
Ryan H. Kim
Stephen Fitz
Dan Hendrycks
ELM
272
48
0
31 Jul 2024
ShieldGemma: Generative AI Content Moderation Based on Gemma
Wenjun Zeng
Yuchi Liu
Ryan Mullins
Ludovic Peran
Joe Fernandez
...
Drew Proud
Piyush Kumar
Bhaktipriya Radharapu
Olivia Sturman
O. Wahltinez
AI4MH
357
112
0
31 Jul 2024
Defending Jailbreak Attack in VLMs via Cross-modality Information Detector
Yue Xu
Xiuyuan Qi
Zhan Qin
Wenjie Wang
AAML
255
6
0
31 Jul 2024
Know Your Limits: A Survey of Abstention in Large Language Models
Bingbing Wen
Jihan Yao
Shangbin Feng
Chenjun Xu
Yulia Tsvetkov
Bill Howe
Lucy Lu Wang
529
5
0
25 Jul 2024
Course-Correction: Safety Alignment Using Synthetic Preferences
Rongwu Xu
Yishuo Cai
Zhenhong Zhou
Renjie Gu
Haiqin Weng
Yan Liu
Tianwei Zhang
Wei Xu
Han Qiu
203
13
0
23 Jul 2024
Revisiting the Robust Alignment of Circuit Breakers
Leo Schwinn
Simon Geisler
AAML
291
7
0
22 Jul 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
443
42
0
20 Jul 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
578
67
0
16 Jul 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Shu Yang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
388
49
0
12 Jul 2024
T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
Yibo Miao
Yifan Zhu
Yinpeng Dong
Lijia Yu
Jun Zhu
Xiao-Shan Gao
EGVM
304
40
0
08 Jul 2024
R
2
R^2
R
2
-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Mintong Kang
Yue Liu
LRM
312
34
0
08 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
345
208
0
05 Jul 2024
Purple-teaming LLMs with Adversarial Defender Training
Jingyan Zhou
Kun Li
Junan Li
Jiawen Kang
Minda Hu
Xixin Wu
Helen Meng
AAML
229
1
0
01 Jul 2024
Badllama 3: removing safety finetuning from Llama 3 in minutes
Dmitrii Volkov
134
6
0
01 Jul 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi
Alexander Wei
Eric Wallace
Tony T. Wang
Nika Haghtalab
Jacob Steinhardt
SILM
AAML
230
58
0
28 Jun 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han
Kavel Rao
Allyson Ettinger
Liwei Jiang
Bill Yuchen Lin
Nathan Lambert
Yejin Choi
Nouha Dziri
383
234
0
26 Jun 2024
AI Risk Categorization Decoded (AIR 2024): From Government Regulations to Corporate Policies
Yi Zeng
Kevin Klyman
Andy Zhou
Yuzhou Nie
Minzhou Pan
Ruoxi Jia
Dawn Song
Abigail Z. Jacobs
Bo Li
262
44
0
25 Jun 2024
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
Yi Zeng
Weiyu Sun
Tran Ngoc Huynh
Dawn Song
Bo Li
Ruoxi Jia
AAML
LLMSV
226
46
0
24 Jun 2024
INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness
Hung Le
Yingbo Zhou
Caiming Xiong
Silvio Savarese
Doyen Sahoo
284
7
0
23 Jun 2024
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland
Alexander Lyzhov
Jacob Pfau
Salsabila Mahdi
Samuel R. Bowman
LLMSV
AAML
253
39
0
21 Jun 2024
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Han Jiang
Xiaoyuan Yi
Zhihua Wei
Ziang Xiao
Shu Wang
Xing Xie
ELM
ALM
652
12
0
20 Jun 2024
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen
Xiaozhi Wang
Zijun Yao
Yushi Bai
Lei Hou
Juanzi Li
LLMSV
KELM
349
26
0
20 Jun 2024
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie
Xiangyu Qi
Yi Zeng
Yangsibo Huang
Udari Madhushani Sehwag
...
Bo Li
Kai Li
Danqi Chen
Peter Henderson
Prateek Mittal
ALM
ELM
431
141
0
20 Jun 2024
AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents
Neural Information Processing Systems (NeurIPS), 2024
Edoardo Debenedetti
Jie Zhang
Mislav Balunović
Luca Beurer-Kellner
Marc Fischer
Florian Tramèr
LLMAG
AAML
425
84
1
19 Jun 2024
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner
Kenneth Li
Yiming Wang
Fernanda Viégas
Martin Wattenberg
279
10
0
17 Jun 2024
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
Rima Hazra
Sayan Layek
Somnath Banerjee
Soujanya Poria
KELM
LLMSV
275
24
0
17 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
394
430
0
17 Jun 2024
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
Lingrui Mei
Shenghua Liu
Yiwei Wang
Baolong Bi
Jiayi Mao
Xueqi Cheng
AAML
209
20
0
17 Jun 2024
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
Yuping Lin
Pengfei He
Han Xu
Yue Xing
Makoto Yamada
Hui Liu
Shucheng Zhou
238
46
0
16 Jun 2024
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Zhao Xu
Fan Liu
Hao Liu
AAML
275
31
0
13 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
483
24
0
13 Jun 2024
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
Xiaoshuai Song
Muxi Diao
Guanting Dong
Zhengyang Wang
Yujia Fu
...
Yejie Wang
Zhuoma Gongque
Jianing Yu
Qiuna Tan
Weiran Xu
ELM
402
27
0
12 Jun 2024
Annotation alignment: Comparing LLM and human annotations of conversational safety
Rajiv Movva
Pang Wei Koh
Emma Pierson
ALM
381
21
0
10 Jun 2024
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Xunguang Wang
Daoyuan Wu
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Shuai Wang
Yingjiu Li
Yang Liu
Ning Liu
Juergen Rahmel
AAML
485
35
0
08 Jun 2024
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
Fan Liu
Zhao Xu
Hao Liu
AAML
258
27
0
07 Jun 2024
Improving Alignment and Robustness with Circuit Breakers
Neural Information Processing Systems (NeurIPS), 2024
Andy Zou
Long Phan
Justin Wang
Derek Duenas
Maxwell Lin
Maksym Andriushchenko
Rowan Wang
Zico Kolter
Matt Fredrikson
Dan Hendrycks
AAML
624
214
0
06 Jun 2024
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
AAML
318
71
0
03 Jun 2024
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Yang Liu
Tianyu Pang
Chao Du
Yihao Huang
Jindong Gu
Yang Liu
Simeng Qin
Min Lin
AAML
367
75
0
31 May 2024
Preemptive Answer "Attacks" on Chain-of-Thought Reasoning
Rongwu Xu
Zehan Qi
Wei Xu
LRM
SILM
173
19
0
31 May 2024
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
Haibo Jin
Andy Zhou
Joe D. Menke
Haohan Wang
224
40
0
30 May 2024
AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi
Yangsibo Huang
Yi Zeng
Edoardo Debenedetti
Jonas Geiping
...
Chaowei Xiao
Yue Liu
Dawn Song
Peter Henderson
Prateek Mittal
AAML
280
20
0
29 May 2024
Voice Jailbreak Attacks Against GPT-4o
Xinyue Shen
Yixin Wu
Michael Backes
Yang Zhang
AuLLM
309
28
0
29 May 2024
Previous
1
2
3
...
10
8
9
Next
Page 9 of 10