v1v2 (latest)

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Neural Information Processing Systems (NeurIPS), 2023

4 December 2023

Papers citing "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically"

50 / 167 papers shown

The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

218

14 Apr 2025

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-JudgeMachine-mediated learning (ML), 2025

382

10 Apr 2025

NLP Security and Ethics, in the WildTransactions of the Association for Computational Linguistics (TACL), 2025

413

09 Apr 2025

Separator Injection Attack: Uncovering Dialogue Biases in Large Language Models Caused by Role Separators

231

08 Apr 2025

Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models

676

07 Apr 2025

Multi-Agent Systems Execute Arbitrary Malicious Code

392

15 Mar 2025

Safe Vision-Language Models via Unsafe Weights Manipulation

424

14 Mar 2025

MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming

Giandomenico Cornacchia

Mark Purcell

196

08 Mar 2025

Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

444

08 Mar 2025

Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks

357

28 Feb 2025

Shh, don't say that! Domain Certification in LLMsInternational Conference on Learning Representations (ICLR), 2025

348

26 Feb 2025

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

277

24 Feb 2025

Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs

Giulio Zizzo

Giandomenico Cornacchia

364

24 Feb 2025

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

539

18 Feb 2025

Computational Safety for Generative AI: A Signal Processing Perspective

Pin-Yu Chen

310

18 Feb 2025

JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation

312

11 Feb 2025

Confidence Elicitation: A New Attack Vector for Large Language ModelsInternational Conference on Learning Representations (ICLR), 2025

587

07 Feb 2025

KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs

277

05 Feb 2025

Peering Behind the Shield: Guardrail Identification in Large Language Models

253

03 Feb 2025

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided SearchNeural Information Processing Systems (NeurIPS), 2024

426

28 Jan 2025

Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment

304

22 Jan 2025

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesNeural Information Processing Systems (NeurIPS), 2024

423

20 Jan 2025

Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity

255

19 Jan 2025

Lessons From Red Teaming 100 Generative AI Products

...

Ram Shankar Siva Kumar

305

13 Jan 2025

MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue

...

285

08 Jan 2025

LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models

392

03 Jan 2025

Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

Xiyang Hu

AAML

334

01 Jan 2025

The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models

270

18 Nov 2024

Diversity Helps Jailbreak Large Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

1.1K

06 Nov 2024

Transferable & Stealthy Ensemble Attacks: A Black-Box Jailbreaking Framework for Large Language Models

Yiqi Yang

Hongye Fu

AAML

150

31 Oct 2024

Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMsInternational Joint Conference on Artificial Intelligence (IJCAI), 2024

239

18 Oct 2024

Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation

354

15 Oct 2024

Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak AttacksInternational Conference on Learning Representations (ICLR), 2024

263

05 Oct 2024

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMsInternational Conference on Learning Representations (ICLR), 2024

Somesh Jha

Patrick McDaniel

Huan Sun

Bo Li

Chaowei Xiao

495

03 Oct 2024

Endless Jailbreaks with Bijection LearningInternational Conference on Learning Representations (ICLR), 2024

379

02 Oct 2024

Multimodal Pragmatic Jailbreak on Text-to-image ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

313

27 Sep 2024

PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs

Jiahao Yu

460

23 Sep 2024

Recent Advances in Attack and Defense Approaches of Large Language Models

346

05 Sep 2024

LLMmap: Fingerprinting For Large Language Models

Dario Pasquini

Evgenios M. Kornaropoulos

G. Ateniese

509

22 Jul 2024

Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

436

20 Jul 2024

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

365

11 Jul 2024

Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement

509

01 Jul 2024

Poisoned LangChain: Jailbreak LLMs by LangChain

179

26 Jun 2024

"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak

Lingrui Mei

Shenghua Liu

208

17 Jun 2024

Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications

Stephen Burabari Tete

259

16 Jun 2024

JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models

482

13 Jun 2024

Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents

Avital Shafran

R. Schuster

Vitaly Shmatikov

721

09 Jun 2024

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner

Xunguang Wang

Shuai Wang

Yingjiu Li

Yang Liu

Ning Liu

Juergen Rahmel

AAML

480

08 Jun 2024

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

315

03 Jun 2024

Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

365

30 May 2024