Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2501.18837
Cited By
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
31 January 2025
Mrinank Sharma
Meg Tong
Jesse Mu
Jerry Wei
Jorrit Kruthoff
Scott Goodfriend
Euan Ong
Alwin Peng
Raj Agarwal
Cem Anil
Amanda Askell
Nathan Bailey
Joe Benton
Emma Bluemke
Samuel R. Bowman
Eric Christiansen
Hoagy Cunningham
Andy Dau
Anjali Gopal
Rob Gilson
Logan Graham
Logan Howard
Nimit Kalra
Taesung Lee
Kevin Lin
Peter Lofgren
Francesco Mosconi
Clare O'Hara
Catherine Olsson
Linda Petrini
Samir Rajani
Nikhil Saxena
Alex Silverstein
Tanya Singh
Theodore R. Sumers
Leonard Tang
Kevin K. Troy
Constantin Weisser
Ruiqi Zhong
Giulio Zhou
Jan Leike
Jared Kaplan
Ethan Perez
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (10 upvotes)
Papers citing
"Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming"
50 / 63 papers shown
Invasive Context Engineering to Control Large Language Models
Thomas Rivasseau
68
0
0
02 Dec 2025
Password-Activated Shutdown Protocols for Misaligned Frontier Agents
Kai Williams
Rohan Subramani
Francis Rhys Ward
85
0
0
29 Nov 2025
The Impact of Off-Policy Training Data on Probe Generalisation
Nathalie Kirch
Samuel Dower
Adrians Skapars
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
128
0
0
21 Nov 2025
Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
Hamin Koo
Minseon Kim
Jaehyung Kim
91
1
0
03 Nov 2025
The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning
Mingrui Liu
Sixiao Zhang
Cheng Long
Kwok Yan Lam
SILM
230
0
0
24 Oct 2025
Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
Mahavir Dabas
Tran Ngoc Huynh
Nikhil Reddy Billa
Jiachen T. Wang
Peng Gao
...
Yao Ma
Rahul Gupta
Ming Jin
Prateek Mittal
R. Jia
AAML
165
0
0
24 Oct 2025
ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases
Ziqian Zhong
Aditi Raghunathan
Nicholas Carlini
86
2
0
23 Oct 2025
Agentic Reinforcement Learning for Search is Unsafe
Yushi Yang
Shreyansh Padarha
Andrew Lee
Adam Mahdi
LRM
134
0
0
20 Oct 2025
CourtGuard: A Local, Multiagent Prompt Injection Classifier
Isaac Wu
Michael Maslowski
LLMAG
AAML
SILM
258
0
0
20 Oct 2025
Qwen3Guard Technical Report
H. Vicky Zhao
C. Yuan
Fei Huang
X. S. Hu
Yichang Zhang
...
Y. Li
Yi Zhang
Yong Jiang
Yu Wan
Y. Zhou
157
21
0
16 Oct 2025
Don't Walk the Line: Boundary Guidance for Filtered Generation
Sarah Ball
Andreas Haupt
165
1
0
13 Oct 2025
All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
Shiyuan Guo
Henry Sleight
Fabien Roger
ELM
LRM
182
0
0
10 Oct 2025
Incremental Hybrid Ensemble with Graph Attention and Frequency-Domain Features for Stable Long-Term Credit Risk Modeling
Jiajing Wang
97
2
0
09 Oct 2025
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
Jingyu Zhang
Haozhu Wang
Eric Michael Smith
Sid Wang
Amr Sharaf
Mahesh Pasupuleti
Benjamin Van Durme
Daniel Khashabi
Jason Weston
Hongyuan Zhan
121
1
0
09 Oct 2025
Agentic Misalignment: How LLMs Could Be Insider Threats
Aengus Lynch
Benjamin Wright
Caleb Larson
Stuart Ritchie
Sören Mindermann
Ethan Perez
Kevin K. Troy
Evan Hubinger
167
41
0
05 Oct 2025
Bypassing Prompt Guards in Production with Controlled-Release Prompting
Jaiden Fairoze
Sanjam Garg
Keewoo Lee
Mingyuan Wang
SILM
AAML
257
1
0
02 Oct 2025
Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed
Isha Gupta
Rylan Schaeffer
Joshua Kazdan
Katja Filippova
Sanmi Koyejo
OOD
AAML
293
1
0
01 Oct 2025
Large-Scale Constraint Generation - Can LLMs Parse Hundreds of Constraints?
Matteo Boffa
Jiaxuan You
182
0
0
28 Sep 2025
D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models
Satyapriya Krishna
Andy Zou
Rahul Gupta
Eliot Krzysztof Jones
Nick Winter
Dan Hendrycks
J. Zico Kolter
Matt Fredrikson
Spyros Matsoukas
AAML
ELM
LRM
117
2
0
22 Sep 2025
Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain
Shakiba Amirshahi
Amin Bigdeli
Charles L. A. Clarke
Amira Ghenai
AAML
145
2
0
04 Sep 2025
PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming
Wesley Hanwen Deng
Sunnie S. Y. Kim
Akshita Jha
Ken Holstein
Motahhare Eslami
Lauren Wilcox
Leon A Gatys
223
4
0
03 Sep 2025
CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention
Xiaomeng Hu
Fei Huang
Chenhan Yuan
Junyang Lin
Tsung-Yi Ho
167
2
0
01 Sep 2025
Evaluating Language Model Reasoning about Confidential Information
Dylan Sam
Alexander Robey
Andy Zou
Matt Fredrikson
J. Zico Kolter
ELM
LRM
135
1
0
27 Aug 2025
Real-Time Detection of Hallucinated Entities in Long-Form Generation
Oscar Obeso
Andy Arditi
Javier Ferrando
Joshua Freeman
Cameron Holmes
Neel Nanda
HILM
220
6
0
26 Aug 2025
Involuntary Jailbreak: On Self-Prompting Attacks
Yangyang Guo
Yangyan Li
Mohan Kankanhalli
221
1
0
18 Aug 2025
Amazon Nova AI Challenge -- Trusted AI: Advancing secure, AI-assisted software development
Sattvik Sahai
Prasoon Goyal
Michael Johnston
Anna Gottardi
Yao Lu
...
Lavina Vaz
Leslie Ball
Maureen Murray
Rahul Gupta
Shankar Ananthakrishna
113
1
0
13 Aug 2025
Multi-Turn Jailbreaks Are Simpler Than They Seem
Xiaoxue Yang
Jaeha Lee
Anna-Katharina Dick
Jasper Timm
Fei Xie
Diogo Cruz
AAML
MU
136
5
0
11 Aug 2025
Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity
Zuoou Li
Weitong Zhang
Jingyuan Wang
Shuyuan Zhang
Wenjia Bai
Bernhard Kainz
Mengyun Qiao
AAML
136
0
0
11 Aug 2025
A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection
Ivan Zhang
AAML
103
0
0
10 Aug 2025
PurpCode: Reasoning for Safer Code Generation
Jiawei Liu
Nirav Diwan
Zhe Wang
Haoyu Zhai
Xiaona Zhou
...
Hadjer Benkraouda
Yuxiang Wei
Lingming Zhang
Ismini Lourentzou
Gang Wang
SILM
LRM
ELM
448
7
0
25 Jul 2025
Combining Cost-Constrained Runtime Monitors for AI Safety
Tim Tian Hua
James Baskerville
Henri Lemoine
Mia Hopman
Aryan Bhatt
Tyler Tracy
369
8
0
19 Jul 2025
Attention-Aware GNN-based Input Defense against Multi-Turn LLM Jailbreak
Zixuan Huang
Kecheng Huang
Lihao Yin
Bowei He
Huiling Zhen
Mingxuan Yuan
Zili Shao
AAML
378
0
0
09 Jul 2025
Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm
Baixiang Huang
Zhen Tan
Haoran Wang
Zijie Liu
Dawei Li
Ali Payani
Huan Liu
Tianlong Chen
Kai Shu
KELM
LLMSV
289
0
0
25 Jun 2025
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
Rohan Gupta
Erik Jenner
374
3
0
17 Jun 2025
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
Christina Q. Knight
Kaustubh Deshpande
Ved Sirdeshmukh
Meher Mankikar
Scale Red Team
SEAL Research Team
Julian Michael
AAML
ELM
313
3
0
17 Jun 2025
Jailbreak Transferability Emerges from Shared Representations
Rico Angell
Jannik Brinkmann
He He
368
0
0
15 Jun 2025
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Chen Yueh-Han
Nitish Joshi
Yulin Chen
Maksym Andriushchenko
Rico Angell
He He
AAML
315
4
0
12 Jun 2025
Detecting High-Stakes Interactions with Activation Probes
Alex McKenzie
Urja Pawar
Phil Blandfort
William Bankes
David M. Krueger
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
603
13
0
12 Jun 2025
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Hiroshi Matsuda
Chunpeng Ma
Masayuki Asahara
318
3
0
11 Jun 2025
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Hao Peng
Yunjia Qi
Xiaozhi Wang
Bin Xu
Lei Hou
Juanzi Li
OffRL
291
9
0
11 Jun 2025
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Yang Li
Qiang Sheng
Yehan Yang
Xueyao Zhang
Juan Cao
339
7
0
11 Jun 2025
Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values
Nell Watson
Ahmed Amer
Evan Harris
Preeti Ravindra
Shujun Zhang
232
1
0
08 Jun 2025
Benchmarking Misuse Mitigation Against Covert Adversaries
Davis Brown
Mahdi Sabbaghi
Luze Sun
Avi Schwarzschild
George Pappas
Eric Wong
Hamed Hassani
159
2
0
06 Jun 2025
Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Bumjin Park
Jinsil Lee
Jaesik Choi
221
0
0
01 Jun 2025
A Red Teaming Roadmap Towards System-Level Safety
Zifan Wang
Christina Q. Knight
Jeremy Kritz
Willow Primack
Julian Michael
AAML
305
1
0
30 May 2025
Learning Safety Constraints for Large Language Models
Xin Chen
Yarden As
Andreas Krause
181
7
0
30 May 2025
Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations
Sanjay Kariyappa
G. E. Suh
203
3
0
25 May 2025
An Example Safety Case for Safeguards Against Misuse
Joshua Clymer
Jonah Weinbaum
Robert Kirk
Kimberly Mai
Selena Zhang
Xander Davies
175
2
0
23 May 2025
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
Csaba Dékány
Stefan Balauca
Robin Staab
Dimitar I. Dimitrov
Martin Vechev
AAML
305
1
0
22 May 2025
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Yu Ying Chiu
Zhilin Wang
Sharan Maiya
Yejin Choi
Kyle Fish
Sydney Levine
Evan Hubinger
257
7
0
20 May 2025
1
2
Next
Page 1 of 2