ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.02483
  4. Cited By
Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?

Neural Information Processing Systems (NeurIPS), 2023
5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
ArXiv (abs)PDFHTMLHuggingFace (13 upvotes)Github

Papers citing "Jailbroken: How Does LLM Safety Training Fail?"

50 / 882 papers shown
SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security
SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security
Wei Zhao
Zhe Li
Jun Sun
AAML
202
0
0
04 Dec 2025
Invasive Context Engineering to Control Large Language Models
Invasive Context Engineering to Control Large Language Models
Thomas Rivasseau
109
0
0
02 Dec 2025
TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful?
TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful?
Lewen Yan
Jilin Mei
Tianyi Zhou
Lige Huang
Jie Zhang
Dongrui Liu
Jing Shao
AAMLAIFin
423
1
0
01 Dec 2025
Red Teaming Large Reasoning Models
Red Teaming Large Reasoning Models
Jiawei Chen
Y. Yang
Chao Yu
Yu Tian
Zhi Cao
Linghao Li
Hang Su
Z. Yin
Zhaoxia Yin
HILMKELMLRMELM
215
0
0
29 Nov 2025
Are LLMs Good Safety Agents or a Propaganda Engine?
Are LLMs Good Safety Agents or a Propaganda Engine?
Neemesh Yadav
Francesco Ortu
Jiarui Liu
Joeun Yook
Bernhard Schölkopf
Rada Mihalcea
Alberto Cazzaniga
Zhijing Jin
121
0
0
28 Nov 2025
Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks
Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks
Richard J. Young
ELM
192
0
0
27 Nov 2025
Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines
Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines
Yuhang Wang
Yanxu Zhu
Dongyuan Lu
Jitao Sang
AAMLSILMELMLRM
621
0
0
26 Nov 2025
FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models
FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models
M. Fatehkia
Enes Altinisik
Husrev Taha Sencar
134
0
0
24 Nov 2025
Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs
Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs
Andrew Maranhão Ventura Dáddario
AAML
191
0
0
24 Nov 2025
Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion
Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion
Yu Cui
Yifei Liu
Hang Fu
Sicheng Pan
Haibin Zhang
Cong Zuo
Licheng Wang
262
1
0
24 Nov 2025
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
Junbo Zhang
Ran Chen
Qianli Zhou
Xinyang Deng
Wen Jiang
232
3
0
24 Nov 2025
Representational and Behavioral Stability of Truth in Large Language Models
Representational and Behavioral Stability of Truth in Large Language Models
Samantha Dies
Courtney Maynard
Germans Savcisens
Tina Eliassi-Rad
HILM
376
0
0
24 Nov 2025
TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization
TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization
Yanting Wang
Runpeng Geng
Jinghui Chen
Minhao Cheng
Jinyuan Jia
336
0
0
23 Nov 2025
Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries
Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries
Y. Zhang
Shibo Cui
Baojun Liu
Jingkai Yu
Min Zhang
Fan Shi
Han Zheng
ELM
353
0
0
22 Nov 2025
The Impact of Off-Policy Training Data on Probe Generalisation
The Impact of Off-Policy Training Data on Probe Generalisation
Nathalie Kirch
Samuel Dower
Adrians Skapars
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
Dmitrii Krasheninnikov
197
0
0
21 Nov 2025
Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models
Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models
Zhiyuan Xu
Stanislav Abaimov
Joseph Gardiner
Sana Belguith
LLMSV
256
0
0
21 Nov 2025
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
Piercosma Bisconti
Matteo Prandi
Federico Pierucci
Francesco Giarrusso
Marcantonio Bracale
Marcello Galisai
Vincenzo Suriani
Olga E. Sorokoletova
Federico Sartore
Daniele Nardi
AAML
965
10
0
19 Nov 2025
When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers
When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers
Zhaoxin Zhang
Borui Chen
Yiming Hu
Youyang Qu
Tianqing Zhu
Longxiang Gao
72
0
0
19 Nov 2025
Entropy-Based Measurement of Value Drift and Alignment Work in Large Language Models
Entropy-Based Measurement of Value Drift and Alignment Work in Large Language Models
Samih Fadli
110
0
0
19 Nov 2025
N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator
N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator
Zheyu Lin
Jirui Yang
Hengqi Guo
Yubing Bao
Yao Guan
Yao Guan
191
0
0
18 Nov 2025
LLM Reinforcement in Context
LLM Reinforcement in Context
Thomas Rivasseau
119
0
0
16 Nov 2025
GRAPHTEXTACK: A Realistic Black-Box Node Injection Attack on LLM-Enhanced GNNs
GRAPHTEXTACK: A Realistic Black-Box Node Injection Attack on LLM-Enhanced GNNs
Jiaji Ma
Puja Trivedi
Danai Koutra
126
0
0
16 Nov 2025
Generalized-Scale Object Counting with Gradual Query Aggregation
Generalized-Scale Object Counting with Gradual Query Aggregation
Jer Pelhan
A. Lukežič
Matej Kristan
ObjD
297
0
0
11 Nov 2025
EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers
EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers
Yilin Jiang
Mingzi Zhang
Xuanyu Yin
Sheng Jin
Suyu Lu
Zuocan Ying
Zengyi Yu
Xiangjie Kong
ELM
219
0
0
10 Nov 2025
KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs
KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs
Shuyuan Liu
Jiawei Chen
Xiao Yang
Hang Su
Z. Yin
AAML
214
0
0
09 Nov 2025
Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity
Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity
Austin Meek
Eitan Sprejer
Iván Arcuschin
A. Brockmeier
Steven Basart
LRM
244
4
0
31 Oct 2025
Prevalence of Security and Privacy Risk-Inducing Usage of AI-based Conversational Agents
Prevalence of Security and Privacy Risk-Inducing Usage of AI-based Conversational Agents
Kathrin Grosse
Nico Ebert
SILM
518
0
0
31 Oct 2025
Reasoning Up the Instruction Ladder for Controllable Language Models
Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng
Vidhisha Balachandran
Chan Young Park
Faeze Brahman
Sachin Kumar
LRM
316
2
0
30 Oct 2025
Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token
Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token
Shaked Zychlinski
Yuval Kainan
108
0
0
30 Oct 2025
The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems
The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems
Stefano Natangelo
218
2
0
28 Oct 2025
The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning
The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning
Mingrui Liu
Sixiao Zhang
Cheng Long
Kwok Yan Lam
SILM
271
2
0
24 Oct 2025
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency
Yukun Jiang
Mingjie Li
Michael Backes
Yang Zhang
224
9
0
24 Oct 2025
Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models
Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models
Sarah Ball
Niki Hasrati
Alexander Robey
Avi Schwarzschild
Frauke Kreuter
Zico Kolter
Andrej Risteski
AAML
344
0
0
24 Oct 2025
FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains
FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains
Hamed Jelodar
Samita Bai
Roozbeh Razavi-Far
Ali Ghorbani
128
0
0
21 Oct 2025
Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting
Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting
Chenchen Tan
Youyang Qu
X. Li
Hui Zhang
Shujie Cui
Cunjian Chen
Longxiang Gao
MUKELM
331
1
0
20 Oct 2025
Agentic Reinforcement Learning for Search is Unsafe
Agentic Reinforcement Learning for Search is Unsafe
Yushi Yang
Shreyansh Padarha
Andrew Lee
Adam Mahdi
LRM
167
1
0
20 Oct 2025
Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization
Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization
Masahiro Kaneko
Zeerak Talat
Timothy Baldwin
AAML
194
3
0
19 Oct 2025
Black-box Optimization of LLM Outputs by Asking for Directions
Black-box Optimization of LLM Outputs by Asking for Directions
Jie Zhang
Meng Ding
Yang Liu
Jue Hong
F. Tramèr
AAML
217
3
0
19 Oct 2025
Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning
Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning
Bingqi Shang
Yiwei Chen
Yihua Zhang
Bingquan Shen
Sijia Liu
MUKELMAAML
279
2
0
19 Oct 2025
Toward Understanding Security Issues in the Model Context Protocol Ecosystem
Toward Understanding Security Issues in the Model Context Protocol Ecosystem
Xiaofan Li
Xing Gao
228
4
0
18 Oct 2025
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
Hanbin Hong
Shuya Feng
Nima Naderloui
Shenao Yan
Jingyu Zhang
Biying Liu
Ali Arastehfard
Heqing Huang
Yuan Hong
AAML
312
2
0
17 Oct 2025
When Flatness Does (Not) Guarantee Adversarial Robustness
When Flatness Does (Not) Guarantee Adversarial Robustness
Nils Philipp Walter
Linara Adilova
Jilles Vreeken
Michael Kamp
202
4
0
16 Oct 2025
Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems
Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems
Edoardo Allegrini
Ananth Shreekumar
Z. Berkay Celik
160
3
0
15 Oct 2025
RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs
RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs
Tuan T. Nguyen
John Le
Thai T. Vu
Willy Susilo
Heath Cooper
142
0
0
14 Oct 2025
Don't Walk the Line: Boundary Guidance for Filtered Generation
Don't Walk the Line: Boundary Guidance for Filtered Generation
Sarah Ball
Andreas Haupt
209
1
0
13 Oct 2025
BlackIce: A Containerized Red Teaming Toolkit for AI Security Testing
BlackIce: A Containerized Red Teaming Toolkit for AI Security Testing
Caelin Kaplan
Alexander Warnecke
Neil Archibald
VLM
173
0
0
13 Oct 2025
Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting
Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting
Heming Xia
Cunxiao Du
Rui Li
Chak Tou Leong
Yongqi Li
Wenjie Li
LLMAGAAMLLRM
159
0
0
12 Oct 2025
ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-testJournal of Network and Computer Applications (JNCA), 2025
Guan-Yan Yang
Tzu-Yu Cheng
Ya-Wen Teng
Farn Wanga
Kuo-Hui Yeh
154
4
0
11 Oct 2025
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
Milad Nasr
Nicholas Carlini
Chawin Sitawarin
Sander Schulhoff
Jamie Hayes
...
Ilia Shumailov
Abhradeep Thakurta
Kai Yuanqing Xiao
Seth Neel
F. Tramèr
AAMLELM
234
43
0
10 Oct 2025
A geometrical approach to solve the proximity of a point to an axisymmetric quadric in space
A geometrical approach to solve the proximity of a point to an axisymmetric quadric in space
Bibekananda Patra
Aditya Mahesh Kolte
Sandipan Bandyopadhyay
236
13
0
10 Oct 2025
1234...161718
Next
Page 1 of 18
Pageof 18