Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

3 June 2024

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (61★)

Papers citing "Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses"

44 / 44 papers shown

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

452

24 Nov 2025

SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models

230

17 Oct 2025

A geometrical approach to solve the proximity of a point to an axisymmetric quadric in space

Bibekananda Patra

Aditya Mahesh Kolte

Sandipan Bandyopadhyay

122

10 Oct 2025

Imperceptible Jailbreaking against Large Language Models

130

06 Oct 2025

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks

204

03 Oct 2025

SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

123

30 Sep 2025

SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

103

21 Aug 2025

Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation

193

19 Aug 2025

Mitigating Jailbreaks with Intent-Aware LLMs

141

16 Aug 2025

Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

186

14 Aug 2025

Many-Turn Jailbreaking

150

09 Aug 2025

MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs

223

27 Jun 2025

Lifelong Safety Alignment for Language Models

292

26 May 2025

What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

231

26 May 2025

Chain-of-Lure: A Universal Jailbreak Attack Framework using Unconstrained Synthetic Narratives

284

23 May 2025

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

304

22 May 2025

"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs

Darpan Aswal

Siddharth D Jaiswal

AAML

235

20 May 2025

Multilingual Collaborative Defense for Large Language Models

387

17 May 2025

PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context OptimizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

505

15 May 2025

RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

Bang An

Shiyue Zhang

Mark Dredze

435

25 Apr 2025

Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models

Tri Nguyen

Lohith Srikanth Pentapalli

...

281

21 Apr 2025

Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses

Zhengchun Shang

Wenlan Wei

Weiheng Bai

AAML

475

02 Apr 2025

Exploiting Instruction-Following Retrievers for Malicious Information RetrievalAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Parishad BehnamGhader

Nicholas Meade

Siva Reddy

299

11 Mar 2025

MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming

Giandomenico Cornacchia

Mark Purcell

196

08 Mar 2025

GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods

393

24 Feb 2025

SQL Injection Jailbreak: A Structural Disaster of Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

609

03 Nov 2024

Plentiful Jailbreaks with String Compositions

Brian R. Y. Huang

AAML

393

01 Nov 2024

Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs

Muhammad Abdul-Mageed

Shady Shehata

275

31 Oct 2024

JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework

321

11 Oct 2024

Instructional Segment Embedding: Improving LLM Safety with Instruction HierarchyInternational Conference on Learning Representations (ICLR), 2024

406

09 Oct 2024

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win RatesInternational Conference on Learning Representations (ICLR), 2024

Qian Liu

291

09 Oct 2024

FlipAttack: Jailbreak LLMs via Flipping

Yue Liu

Miao Xiong

Bryan Hooi

244

02 Oct 2024

AI Safety in Generative AI Large Language Models: A Survey

Lina Yao

361

06 Jul 2024

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Zhen Sun

Qi Li

344

198

05 Jul 2024

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Zhao Xu

Fan Liu

Hao Liu

AAML

274

13 Jun 2024

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner

Xunguang Wang

Shuai Wang

Yingjiu Li

Yang Liu

Ning Liu

Juergen Rahmel

AAML

481

08 Jun 2024

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

Fan Liu

Zhao Xu

Hao Liu

AAML

258

07 Jun 2024

Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

366

30 May 2024

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive AttacksInternational Conference on Learning Representations (ICLR), 2024

Maksym Andriushchenko

Francesco Croce

Nicolas Flammarion

AAML

793

374

02 Apr 2024

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao

Edoardo Debenedetti

Avi Schwarzschild

Maksym Andriushchenko

...

George J. Pappas

472

291

28 Mar 2024

Defending Jailbreak Prompts via In-Context Adversarial Game

492

20 Feb 2024

Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

196

14 Feb 2024

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

George J. Pappas

581

394

05 Oct 2023

Certifying LLM Safety against Adversarial Prompting

Himabindu Lakkaraju

714

273

06 Sep 2023