v1v2 (latest)

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

6 February 2024

ArXiv (abs)PDF HTML HuggingFace (6 upvotes)Github (652★)

Papers citing "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"

50 / 487 papers shown

Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector

178

30 Oct 2024

AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts

298

29 Oct 2024

Stealthy Jailbreak Attacks on Large Language Models via Benign Data MirroringNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

...

293

28 Oct 2024

Adversarial Attacks on Large Language Models Using Regularized Relaxation

252

24 Oct 2024

Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models

He Cao

Weidi Luo

Zijing Liu

Yu Wang

Bing Feng

Xingtai Lv

Yuan Yao

Yu Li

AAML

233

23 Oct 2024

SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

405

22 Oct 2024

Bayesian scaling laws for in-context learning

548

21 Oct 2024

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

129

20 Oct 2024

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the dataInternational Conference on Learning Representations (ICLR), 2024

417

17 Oct 2024

Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

Gintare Karolina Dziugaite

KELM MU

219

16 Oct 2024

Merge to Learn: Efficiently Adding Skills to Language Models with Model MergingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

263

16 Oct 2024

Multi-round jailbreak attack on large language models

Yihua Zhou

Xiaochuan Shi

AAML

194

15 Oct 2024

Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation

358

15 Oct 2024

Cognitive Overload Attack:Prompt Injection for Long Context

288

15 Oct 2024

Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

228

14 Oct 2024

On Calibration of LLM-based Guard Models for Reliable Content ModerationInternational Conference on Learning Representations (ICLR), 2024

422

14 Oct 2024

AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation

173

11 Oct 2024

Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

...

244

11 Oct 2024

Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond

Shanshan Han

608

09 Oct 2024

Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations

Tarun Raheja

Nilay Pochhi

AAML

240

09 Oct 2024

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference TimeInternational Conference on Learning Representations (ICLR), 2024

317

09 Oct 2024

Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

225

08 Oct 2024

SoK: Towards Security and Safety of Edge AI

276

07 Oct 2024

Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak AttacksInternational Conference on Learning Representations (ICLR), 2024

263

05 Oct 2024

You Know What I'm Saying: Jailbreak Attack via Implicit Reference

Tianyu Wu

Lingrui Mei

Ruibin Yuan

Lujun Li

Wei Xue

Yike Guo

230

04 Oct 2024

Aligning LLMs with Individual Preferences via InteractionInternational Conference on Computational Linguistics (COLING), 2024

May Fung

Heng Ji

345

04 Oct 2024

Output Scouting: Auditing Large Language Models for Catastrophic Responses

Andrew Bell

Joao Fonseca

KELM

323

04 Oct 2024

A Probabilistic Perspective on Unlearning and Alignment for Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Yan Scholten

Stephan Günnemann

Leo Schwinn

744

04 Oct 2024

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector AblationInternational Conference on Learning Representations (ICLR), 2024

Xinpeng Wang

Chengzhi Hu

Paul Röttger

Barbara Plank

442

04 Oct 2024

HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

Lingrui Mei

Shenghua Liu

Yiwei Wang

Baolong Bi

Ruibin Yuan

Xueqi Cheng

257

03 Oct 2024

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMsInternational Conference on Learning Representations (ICLR), 2024

Somesh Jha

Patrick McDaniel

Huan Sun

Bo Li

Chaowei Xiao

519

102

03 Oct 2024

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Yi Zeng

345

03 Oct 2024

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Cristian Canton Ferrer

Ivan Evtimov

Aaron Grattafiori

ALM

233

02 Oct 2024

HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard ModelsInternational Conference on Learning Representations (ICLR), 2024

Seanie Lee

Juho Lee

Sung Ju Hwang

405

02 Oct 2024

Endless Jailbreaks with Bijection LearningInternational Conference on Learning Representations (ICLR), 2024

382

02 Oct 2024

Robust LLM safeguarding via refusal feature adversarial trainingInternational Conference on Learning Representations (ICLR), 2024

356

30 Sep 2024

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness RecognitionNeural Information Processing Systems (NeurIPS), 2024

189

29 Sep 2024

GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks

Meng Han

153

29 Sep 2024

Overriding Safety protections of Open-source Models

Sachin Kumar

120

28 Sep 2024

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Tiansheng Huang

Sihao Hu

Fatih Ilhan

Selim Furkan Tekin

Ling Liu

AAML

485

26 Sep 2024

Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn InteractionConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

229

25 Sep 2024

Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI

Ambrish Rawat

Stefan Schoepf

Giulio Zizzo

Giandomenico Cornacchia

Muhammad Zaid Hameed

...

Pin-Yu Chen

221

23 Sep 2024

Backtracking Improves Generation Safety

313

22 Sep 2024

Jailbreaking Large Language Models with Symbolic Mathematics

Emet Bethany

Mazal Bethany

Juan Arturo Nolazco Flores

S. Jha

Peyman Najafirad

AAML

208

17 Sep 2024

Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial AttacksBigData Congress [Services Society] (BSS), 2024

Md Zarif Hossain

Ahmed Imteaj

AAML VLM

261

11 Sep 2024

Recent Advances in Attack and Defense Approaches of Large Language Models

351

05 Sep 2024

Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Bang An

Sicheng Zhu

Ruiyi Zhang

Michael-Andrei Panaitescu-Liess

Yuancheng Xu

Furong Huang

AAML

391

01 Sep 2024

Legilimens: Practical and Unified Content Moderation for Large Language Model ServicesConference on Computer and Communications Security (CCS), 2024

361

28 Aug 2024

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

291

106

27 Aug 2024

Advancing Adversarial Suffix Transfer Learning on Aligned Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

230

27 Aug 2024