132
v1v2 (latest)

Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Main:10 Pages
7 Figures
Bibliography:4 Pages
9 Tables
Appendix:5 Pages
Abstract

Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcomes of our experiments are fourfold. First, we developed a comprehensive hierarchical taxonomy of jailbreak strategies that systematically consolidates techniques previously studied in isolation and harmonizes existing, partially overlapping classifications with explicit cross-references to prior categorizations. The taxonomy organizes jailbreak strategies into seven mechanism-oriented families: impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmarked GPT-5 as a judge for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.

View on arXiv
Comments on this paper