Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2307.02483
Cited By

Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?

Neural Information Processing Systems (NeurIPS), 2023

5 July 2023

Jacob Steinhardt

ArXiv (abs)PDF HTML HuggingFace (13 upvotes)Github

Papers citing "Jailbroken: How Does LLM Safety Training Fail?"

50 / 882 papers shown

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

202

0

0

04 Dec 2025

Invasive Context Engineering to Control Large Language Models

Invasive Context Engineering to Control Large Language Models

Thomas Rivasseau

109

0

0

02 Dec 2025

TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful?

TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful?

423

1

0

01 Dec 2025

Red Teaming Large Reasoning Models

Red Teaming Large Reasoning Models

Zhaoxia Yin

HILM KELM LRM ELM

215

0

0

29 Nov 2025

Are LLMs Good Safety Agents or a Propaganda Engine?

Are LLMs Good Safety Agents or a Propaganda Engine?

Bernhard Schölkopf

Alberto Cazzaniga

121

0

0

28 Nov 2025

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

Richard J. Young

192

0

0

27 Nov 2025

Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

AAML SILM ELM LRM

621

0

0

26 Nov 2025

FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models

FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models

Husrev Taha Sencar

134

0

0

24 Nov 2025

Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs

Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs

Andrew Maranhão Ventura Dáddario

191

0

0

24 Nov 2025

Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion

Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion

262

1

0

24 Nov 2025

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

232

3

0

24 Nov 2025

Representational and Behavioral Stability of Truth in Large Language Models

Representational and Behavioral Stability of Truth in Large Language Models

Courtney Maynard

Germans Savcisens

Tina Eliassi-Rad

376

0

0

24 Nov 2025

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

336

0

0

23 Nov 2025

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

353

0

0

22 Nov 2025

The Impact of Off-Policy Training Data on Probe Generalisation

The Impact of Off-Policy Training Data on Probe Generalisation

Adrians Skapars

Ekdeep Singh Lubana

Dmitrii Krasheninnikov

Dmitrii Krasheninnikov

197

0

0

21 Nov 2025

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models

Stanislav Abaimov

Joseph Gardiner

256

0

0

21 Nov 2025

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Piercosma Bisconti

Federico Pierucci

Francesco Giarrusso

Marcantonio Bracale

Marcello Galisai

Vincenzo Suriani

Olga E. Sorokoletova

Federico Sartore

965

10

0

19 Nov 2025

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

72

0

0

19 Nov 2025

Entropy-Based Measurement of Value Drift and Alignment Work in Large Language Models

Entropy-Based Measurement of Value Drift and Alignment Work in Large Language Models

110

0

0

19 Nov 2025

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

Yao Guan

191

0

0

18 Nov 2025

LLM Reinforcement in Context

LLM Reinforcement in Context

Thomas Rivasseau

119

0

0

16 Nov 2025

GRAPHTEXTACK: A Realistic Black-Box Node Injection Attack on LLM-Enhanced GNNs

GRAPHTEXTACK: A Realistic Black-Box Node Injection Attack on LLM-Enhanced GNNs

126

0

0

16 Nov 2025

Generalized-Scale Object Counting with Gradual Query Aggregation

Generalized-Scale Object Counting with Gradual Query Aggregation

297

0

0

11 Nov 2025

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

219

0

0

10 Nov 2025

KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

214

0

0

09 Nov 2025

Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

Iván Arcuschin

244

4

0

31 Oct 2025

Prevalence of Security and Privacy Risk-Inducing Usage of AI-based Conversational Agents

Prevalence of Security and Privacy Risk-Inducing Usage of AI-based Conversational Agents

518

0

0

31 Oct 2025

Reasoning Up the Instruction Ladder for Controllable Language Models

Reasoning Up the Instruction Ladder for Controllable Language Models

Vidhisha Balachandran

Chan Young Park

316

2

0

30 Oct 2025

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

Shaked Zychlinski

108

0

0

30 Oct 2025

The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems

The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems

Stefano Natangelo

218

2

0

28 Oct 2025

The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

271

2

0

24 Oct 2025

Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency

Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency

224

9

0

24 Oct 2025

Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Alexander Robey

Avi Schwarzschild

Andrej Risteski

344

0

0

24 Oct 2025

FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains

FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains

Roozbeh Razavi-Far

128

0

0

21 Oct 2025

Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

331

1

0

20 Oct 2025

Agentic Reinforcement Learning for Search is Unsafe

Agentic Reinforcement Learning for Search is Unsafe

Shreyansh Padarha

167

1

0

20 Oct 2025

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Masahiro Kaneko

Timothy Baldwin

194

3

0

19 Oct 2025

Black-box Optimization of LLM Outputs by Asking for Directions

Black-box Optimization of LLM Outputs by Asking for Directions

217

3

0

19 Oct 2025

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

279

2

0

19 Oct 2025

Toward Understanding Security Issues in the Model Context Protocol Ecosystem

Toward Understanding Security Issues in the Model Context Protocol Ecosystem

228

4

0

18 Oct 2025

SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models

SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models

Ali Arastehfard

312

2

0

17 Oct 2025

When Flatness Does (Not) Guarantee Adversarial Robustness

When Flatness Does (Not) Guarantee Adversarial Robustness

Nils Philipp Walter

202

4

0

16 Oct 2025

Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems

Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems

Edoardo Allegrini

Ananth Shreekumar

Z. Berkay Celik

160

3

0

15 Oct 2025

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

142

0

0

14 Oct 2025

Don't Walk the Line: Boundary Guidance for Filtered Generation

Don't Walk the Line: Boundary Guidance for Filtered Generation

209

1

0

13 Oct 2025

BlackIce: A Containerized Red Teaming Toolkit for AI Security Testing

BlackIce: A Containerized Red Teaming Toolkit for AI Security Testing

Alexander Warnecke

173

0

0

13 Oct 2025

Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

159

0

0

12 Oct 2025

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-testJournal of Network and Computer Applications (JNCA), 2025

154

4

0

11 Oct 2025

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Nicholas Carlini

Chawin Sitawarin

Sander Schulhoff

...

Abhradeep Thakurta

Kai Yuanqing Xiao

234

43

0

10 Oct 2025

A geometrical approach to solve the proximity of a point to an axisymmetric quadric in space

A geometrical approach to solve the proximity of a point to an axisymmetric quadric in space

Bibekananda Patra

Aditya Mahesh Kolte

Sandipan Bandyopadhyay

236

13

0

10 Oct 2025

1 2 3 4...16 17 18

Page 1 of 18

Pageof 18