Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

31 January 2025

ArXiv (abs)PDF HTML HuggingFace (10 upvotes)

Papers citing "Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming"

50 / 63 papers shown

Invasive Context Engineering to Control Large Language Models

Thomas Rivasseau

02 Dec 2025

Password-Activated Shutdown Protocols for Misaligned Frontier Agents

Kai Williams

Rohan Subramani

Francis Rhys Ward

29 Nov 2025

The Impact of Off-Policy Training Data on Probe Generalisation

Dmitrii Krasheninnikov

128

21 Nov 2025

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Hamin Koo

Minseon Kim

Jaehyung Kim

03 Nov 2025

The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

230

24 Oct 2025

Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

...

165

24 Oct 2025

ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases

Ziqian Zhong

Aditi Raghunathan

Nicholas Carlini

23 Oct 2025

Agentic Reinforcement Learning for Search is Unsafe

134

20 Oct 2025

CourtGuard: A Local, Multiagent Prompt Injection Classifier

Isaac Wu

Michael Maslowski

LLMAG AAML SILM

258

20 Oct 2025

Qwen3Guard Technical Report

...

157

16 Oct 2025

Don't Walk the Line: Boundary Guidance for Filtered Generation

Sarah Ball

Andreas Haupt

165

13 Oct 2025

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language

182

10 Oct 2025

Incremental Hybrid Ensemble with Graph Attention and Frequency-Domain Features for Stable Long-Term Credit Risk Modeling

Jiajing Wang

09 Oct 2025

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

121

09 Oct 2025

Agentic Misalignment: How LLMs Could Be Insider Threats

167

05 Oct 2025

Bypassing Prompt Guards in Production with Controlled-Release Prompting

257

02 Oct 2025

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

293

01 Oct 2025

Large-Scale Constraint Generation - Can LLMs Parse Hundreds of Constraints?

Matteo Boffa

Jiaxuan You

182

28 Sep 2025

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Satyapriya Krishna

Andy Zou

Rahul Gupta

Eliot Krzysztof Jones

117

22 Sep 2025

Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain

145

04 Sep 2025

PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

223

03 Sep 2025

CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention

167

01 Sep 2025

Evaluating Language Model Reasoning about Confidential Information

135

27 Aug 2025

Real-Time Detection of Hallucinated Entities in Long-Form Generation

220

26 Aug 2025

Involuntary Jailbreak: On Self-Prompting Attacks

Yangyang Guo

Yangyan Li

Mohan Kankanhalli

221

18 Aug 2025

Amazon Nova AI Challenge -- Trusted AI: Advancing secure, AI-assisted software development

...

Shankar Ananthakrishna

113

13 Aug 2025

Multi-Turn Jailbreaks Are Simpler Than They Seem

136

11 Aug 2025

Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

136

11 Aug 2025

A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection

Ivan Zhang

AAML

103

10 Aug 2025

PurpCode: Reasoning for Safer Code Generation

...

448

25 Jul 2025

Combining Cost-Constrained Runtime Monitors for AI Safety

369

19 Jul 2025

Attention-Aware GNN-based Input Defense against Multi-Turn LLM Jailbreak

378

09 Jul 2025

Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

289

25 Jun 2025

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

Rohan Gupta

Erik Jenner

374

17 Jun 2025

FORTRESS: Frontier Risk Evaluation for National Security and Public Safety

313

17 Jun 2025

Jailbreak Transferability Emerges from Shared Representations

Rico Angell

Jannik Brinkmann

He He

368

15 Jun 2025

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Chen Yueh-Han

Nitish Joshi

Yulin Chen

Maksym Andriushchenko

Rico Angell

He He

AAML

315

12 Jun 2025

Detecting High-Stakes Interactions with Activation Probes

Dmitrii Krasheninnikov

603

12 Jun 2025

Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs

Hiroshi Matsuda

Chunpeng Ma

Masayuki Asahara

318

11 Jun 2025

VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

291

11 Jun 2025

From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

339

11 Jun 2025

Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values

232

08 Jun 2025

Benchmarking Misuse Mitigation Against Covert Adversaries

159

06 Jun 2025

Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Bumjin Park

Jinsil Lee

Jaesik Choi

221

01 Jun 2025

A Red Teaming Roadmap Towards System-Level Safety

305

30 May 2025

Learning Safety Constraints for Large Language Models

Xin Chen

Yarden As

Andreas Krause

181

30 May 2025

Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations

Sanjay Kariyappa

G. E. Suh

203

25 May 2025

An Example Safety Case for Safeguards Against Misuse

175

23 May 2025

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

305

22 May 2025

Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

257

20 May 2025