v1v2 (latest)

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

6 February 2024

ArXiv (abs)PDF HTML HuggingFace (6 upvotes)Github (652★)

Papers citing "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"

40 / 490 papers shown

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Haibo Jin

Andy Zhou

Joe D. Menke

Haohan Wang

231

30 May 2024

AI Risk Management Should Incorporate Both Safety and Security

Yi Zeng

...

282

29 May 2024

Voice Jailbreak Attacks Against GPT-4o

Michael Backes

334

29 May 2024

Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Wangmeng Zuo

264

28 May 2024

Learning diverse attacks on large language models for robust red-teaming and safety tuning

...

Nikolay Malkin

Moksh Jain

AAML

469

28 May 2024

Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character

Siyuan Ma

Weidi Luo

Yu Wang

Xiaogeng Liu

380

25 May 2024

No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

Chak Tou Leong

Yi Cheng

Kaishuai Xu

Jian Wang

Hanlin Wang

Wenjie Li

AAML

354

25 May 2024

Efficient Adversarial Training in LLMs with Continuous Attacks

Stephan Günnemann

365

103

24 May 2024

ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based EvaluationNeural Information Processing Systems (NeurIPS), 2024

364

23 May 2024

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model BiasNeural Information Processing Systems (NeurIPS), 2024

Shan Chen

...

Danielle S. Bitterman

235

09 May 2024

Don't Say No: Jailbreaking LLM by Suppressing Refusal

428

25 Apr 2024

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

387

133

21 Apr 2024

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

398

12 Apr 2024

AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs

Zeyi Liao

Huan Sun

AAML

326

154

11 Apr 2024

Rethinking How to Evaluate Language Model Jailbreak

272

09 Apr 2024

SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety

Paul Röttger

391

08 Apr 2024

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive AttacksInternational Conference on Learning Representations (ICLR), 2024

Maksym Andriushchenko

Francesco Croce

Nicolas Flammarion

AAML

837

407

02 Apr 2024

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

M. Russinovich

Ahmed Salem

Ronen Eldan

613

230

02 Apr 2024

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao

Edoardo Debenedetti

Avi Schwarzschild

Maksym Andriushchenko

...

George J. Pappas

506

328

28 Mar 2024

Testing the Limits of Jailbreaking Defenses with the Purple Problem

257

20 Mar 2024

Mitigating Dialogue Hallucination for Large Vision Language Models via Adversarial Instruction Tuning

Ser-Nam Lim

271

15 Mar 2024

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

...

796

338

05 Mar 2024

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Yinpeng Dong

254

110

28 Feb 2024

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Kai Zhang

...

468

521

27 Feb 2024

Immunization against harmful fine-tuning attacks

277

26 Feb 2024

LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper

Daoyuan Wu

Shuaibao Wang

Yang Liu

Ning Liu

AAML

264

24 Feb 2024

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Yi Liu

270

21 Feb 2024

A StrongREJECT for Empty Jailbreaks

Dillon Bowen

...

267

219

15 Feb 2024

Instruction Backdoor Attacks Against Customized LLMs

Rui Zhang

Michael Backes

399

14 Feb 2024

Attacking Large Language Models with Projected Gradient Descent

Stephan Günnemann

335

105

14 Feb 2024

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

Stephan Gunnemann

481

14 Feb 2024

COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

493

157

13 Feb 2024

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

Lijun Li

Bowen Dong

Ruohui Wang

Xuhao Hu

Wangmeng Zuo

Dahua Lin

Yu Qiao

Jing Shao

ELM

328

192

07 Feb 2024

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

501

05 Feb 2024

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

Andy Zhou

Bo Li

Haohan Wang

AAML

477

139

30 Jan 2024

SLANG: New Concept Comprehension of Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Lingrui Mei

Shenghua Liu

319

23 Jan 2024

Can LLMs Follow Simple Rules?

391

06 Nov 2023

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Zeming Wei

404

424

10 Oct 2023

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language ModelsInternational Conference on Learning Representations (ICLR), 2023

Xiaogeng Liu

331

613

03 Oct 2023

Baichuan 2: Open Large-scale Language Models

...

855

942

19 Sep 2023