v1v2 (latest)

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Neural Information Processing Systems (NeurIPS), 2023

4 December 2023

Papers citing "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically"

50 / 167 papers shown

RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models

Xiaochun Cao

Tieyun Qian

LRM

157

18 Aug 2025

Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position

Zhixin Xie

Xurui Song

Jun Luo

164

17 Aug 2025

Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

165

14 Aug 2025

Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

133

11 Aug 2025

A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection

Ivan Zhang

AAML

10 Aug 2025

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

203

07 Aug 2025

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants

...

149

05 Aug 2025

The SMeL Test: A simple benchmark for media literacy in language models

Gustaf Ahdritz

Anat Kleiman

217

04 Aug 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

132

31 Jul 2025

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

156

30 Jul 2025

TRIDENT: Benchmarking LLM Safety in Finance, Medicine, and Law

221

22 Jul 2025

Adversarial Manipulation of Reasoning Models using Internal Representations

130

03 Jul 2025

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

242

03 Jul 2025

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

248

03 Jul 2025

VERA: Variational Inference Framework for Jailbreaking Large Language Models

353

27 Jun 2025

MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning

214

20 Jun 2025

From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem

329

18 Jun 2025

FORTRESS: Frontier Risk Evaluation for National Security and Public Safety

299

17 Jun 2025

Exploiting AI for Attacks: On the Interplay between Adversarial AI and Offensive AIIEEE Intelligent Systems (IEEE Intell. Syst.), 2025

143

14 Jun 2025

Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs

Hiroshi Matsuda

Chunpeng Ma

Masayuki Asahara

316

11 Jun 2025

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

369

11 Jun 2025

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

310

09 Jun 2025

Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures

147

09 Jun 2025

AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization

Mukur Gupta

Nikhil Reddy Varimalla

Nicholas Deas

Melanie Subbiah

Kathleen McKeown

278

06 Jun 2025

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities

501

29 May 2025

LLM Agents Should Employ Security Principles

294

29 May 2025

Permissioned LLMs: Enforcing Access Control in Large Language Models

Krishnaram Kenthapadi

310

28 May 2025

Jailbreak Distillation: Renewable Safety Benchmarking

227

28 May 2025

PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing

361

27 May 2025

Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts

173

26 May 2025

PAM: Training Policy-Aligned Moderation Filters at ScaleLinguistics Vanguard (LV), 2024

260

26 May 2025

Security Concerns for Large Language Models: A Survey

Miles Q. Li

Benjamin C. M. Fung

PILM ELM

770

24 May 2025

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

266

24 May 2025

An Example Safety Case for Safeguards Against Misuse

173

23 May 2025

Chain-of-Lure: A Universal Jailbreak Attack Framework using Unconstrained Synthetic Narratives

281

23 May 2025

One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMsInternational Conference on Learning Representations (ICLR), 2025

297

23 May 2025

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

303

22 May 2025

Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers

376

22 May 2025

Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses

Xiaoxue Yang

Bozhidar Stevanoski

Matthieu Meeus

Yves-Alexandre de Montjoye

AAML

306

21 May 2025

SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

711

20 May 2025

AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

437

20 May 2025

One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models

302

12 May 2025

OET: Optimization-based prompt injection Evaluation Toolkit

332

01 May 2025

Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression

265

29 Apr 2025

JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift

316

28 Apr 2025

Prompt Injection Attack to Tool Selection in LLM Agents

405

28 Apr 2025

A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning

Greg Gluch

Shafi Goldwasser

AAML

455

28 Apr 2025

Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs

Mohammad Akbar-Tajari

Mohammad Taher Pilehvar

Mohammad Mahmoody

AAML

207

26 Apr 2025

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

435

22 Apr 2025

RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search

233

21 Apr 2025

All Papers

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Papers citing "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically"