Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.02483
Cited By
Jailbroken: How Does LLM Safety Training Fail?
5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Jailbroken: How Does LLM Safety Training Fail?"
50 / 634 papers shown
Title
Distraction is All You Need for Multimodal Large Language Model Jailbreaking
Zuopeng Yang
Jiluan Fan
Anli Yan
Erdun Gao
Xin Lin
Tao Li
Kanghua mo
Changyu Dong
AAML
70
0
0
15 Feb 2025
Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks
Ang Li
Yin Zhou
Vethavikashini Chithrra Raghuram
Tom Goldstein
Micah Goldblum
AAML
71
7
0
12 Feb 2025
Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment
Haoyu Wang
Zeyu Qin
Li Shen
Xueqian Wang
Minhao Cheng
Dacheng Tao
89
1
0
06 Feb 2025
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs
Buyun Liang
Kwan Ho Ryan Chan
D. Thaker
Jinqi Luo
René Vidal
AAML
41
0
0
05 Feb 2025
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Javier Rando
Jie Zhang
Nicholas Carlini
F. Tramèr
AAML
ELM
54
3
0
04 Feb 2025
Topic-FlipRAG: Topic-Orientated Adversarial Opinion Manipulation Attacks to Retrieval-Augmented Generation Models
Y. Gong
Zhuo Chen
Miaokun Chen
Fengchang Yu
Wei-Tsung Lu
XiaoFeng Wang
Xiaozhong Liu
J. Liu
AAML
SILM
58
0
0
03 Feb 2025
Breaking Focus: Contextual Distraction Curse in Large Language Models
Yue Huang
Yanbo Wang
Zixiang Xu
Chujie Gao
Siyuan Wu
Jiayi Ye
Xiuying Chen
Pin-Yu Chen
X. Zhang
AAML
43
1
0
03 Feb 2025
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models
Isha Gupta
David Khachaturov
Robert D. Mullins
AAML
AuLLM
60
1
0
02 Feb 2025
Trading Inference-Time Compute for Adversarial Robustness
Wojciech Zaremba
Evgenia Nitishinskaya
Boaz Barak
Stephanie Lin
Sam Toyer
...
Rachel Dias
Eric Wallace
Kai Y. Xiao
Johannes Heidecke
Amelia Glaese
LRM
AAML
85
15
0
31 Jan 2025
The Pitfalls of "Security by Obscurity" And What They Mean for Transparent AI
Peter Hall
Olivia Mundahl
Sunoo Park
71
0
0
30 Jan 2025
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Xuan Chen
Yuzhou Nie
Wenbo Guo
Xiangyu Zhang
110
9
0
28 Jan 2025
HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor
Zihui Wu
Haichang Gao
Jiacheng Luo
Zhaoxiang Liu
35
0
0
23 Jan 2025
Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment
Melissa Kazemi Rad
Huy Nghiem
Andy Luo
Sahil Wadhwa
Mohammad Sorower
Stephen Rawls
AAML
91
2
0
22 Jan 2025
Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity
David Williams-King
Linh Le
Adam Oberman
Yoshua Bengio
AAML
51
0
0
19 Jan 2025
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
Jonathan Nöther
Adish Singla
Goran Radanović
AAML
55
0
0
14 Jan 2025
Lessons From Red Teaming 100 Generative AI Products
Blake Bullwinkel
Amanda Minnich
Shiven Chawla
Gary Lopez
Martin Pouliot
...
Pete Bryan
Ram Shankar Siva Kumar
Yonatan Zunger
Chang Kawaguchi
Mark Russinovich
AAML
VLM
37
4
0
13 Jan 2025
Safeguarding System Prompts for LLMs
Zhifeng Jiang
Zhihua Jin
Guoliang He
AAML
SILM
103
1
0
10 Jan 2025
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Bill Yuchen Lin
Radha Poovendran
SILM
66
5
0
08 Jan 2025
CALM: Curiosity-Driven Auditing for Large Language Models
Xiang Zheng
Longxiang Wang
Yi Liu
Xingjun Ma
Chao Shen
Cong Wang
MLAU
47
0
0
06 Jan 2025
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense
Yang Ouyang
Hengrui Gu
Shuhang Lin
Wenyue Hua
Jie Peng
B. Kailkhura
Tianlong Chen
Kaixiong Zhou
Kaixiong Zhou
AAML
31
1
0
05 Jan 2025
Security Attacks on LLM-based Code Completion Tools
Wen Cheng
Ke Sun
Xinyu Zhang
Wei Wang
SILM
AAML
ELM
55
3
0
03 Jan 2025
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
Miao Yu
Junfeng Fang
Yingjie Zhou
Xing Fan
Kun Wang
Shirui Pan
Qingsong Wen
AAML
58
0
0
03 Jan 2025
DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak
Hao Wang
Hao Li
Junda Zhu
Xinyuan Wang
C. Pan
Minlie Huang
Lei Sha
68
0
0
23 Dec 2024
Sim911: Towards Effective and Equitable 9-1-1 Dispatcher Training with an LLM-Enabled Simulation
Zirong Chen
Elizabeth Chason
Noah Mladenovski
Erin Wilson
Kristin Mullen
Stephen Martini
Meiyi Ma
70
2
0
22 Dec 2024
Beyond Partisan Leaning: A Comparative Analysis of Political Bias in Large Language Models
Kaiqi Yang
Hang Li
Yucheng Chu
Hang Li
Tai-Quan Peng
Yuping Lin
Hui Liu
80
1
0
21 Dec 2024
Jailbreaking? One Step Is Enough!
Weixiong Zheng
Peijian Zeng
Y. Li
Hongyan Wu
Nankai Lin
J. Chen
Aimin Yang
Y. Zhou
AAML
76
0
0
17 Dec 2024
Towards Action Hijacking of Large Language Model-based Agent
Yuyang Zhang
Kangjie Chen
Xudong Jiang
Yuxiang Sun
Run Wang
Lina Wang
LLMAG
AAML
73
2
0
14 Dec 2024
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
Zhiyu Xue
Guangliang Liu
Bocheng Chen
K. Johnson
Ramtin Pedarsani
AAML
68
0
0
13 Dec 2024
FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks
Bocheng Chen
Hanqing Guo
Qiben Yan
AAML
63
0
0
10 Dec 2024
Improved Large Language Model Jailbreak Detection via Pretrained Embeddings
Erick Galinkin
Martin Sablotny
68
0
0
02 Dec 2024
Yi-Lightning Technical Report
01. AI
:
Alan Wake
Albert Wang
Bei Chen
...
Yuxuan Sha
Zhaodong Yan
Zhiyuan Liu
Zirui Zhang
Zonghong Dai
OSLM
97
3
0
02 Dec 2024
Quantized Delta Weight Is Safety Keeper
Yule Liu
Zhen Sun
Xinlei He
Xinyi Huang
80
2
0
29 Nov 2024
Examining Multimodal Gender and Content Bias in ChatGPT-4o
Roberto Balestri
76
2
0
28 Nov 2024
Don't Let Your Robot be Harmful: Responsible Robotic Manipulation
Minheng Ni
Lei Zhang
Z. Chen
L. Zhang
Wangmeng Zuo
62
1
0
27 Nov 2024
Privacy-Preserving Behaviour of Chatbot Users: Steering Through Trust Dynamics
Julia Ive
Vishal Yadav
Mariia Ignashina
Matthew Rand
Paulina Bondaronek
64
0
0
26 Nov 2024
Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective
Jean Marie Tshimula
Xavier Ndona
D'Jeff K. Nkashama
Pierre Martin Tardif
F. Kabanza
Marc Frappier
Shengrui Wang
SILM
81
0
0
25 Nov 2024
ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
Haochen Zhao
Xiangru Tang
Ziran Yang
Xiao Han
Xuanzhi Feng
...
Senhao Cheng
Di Jin
Yilun Zhao
Arman Cohan
Mark B. Gerstein
ELM
80
1
0
23 Nov 2024
Global Challenge for Safe and Secure LLMs Track 1
Xiaojun Jia
Yihao Huang
Yang Liu
Peng Yan Tan
Weng Kuan Yau
...
Yan Wang
Rick Siow Mong Goh
Liangli Zhen
Yingjie Zhang
Zhe Zhao
ELM
AILaw
64
0
0
21 Nov 2024
Steering Language Model Refusal with Sparse Autoencoders
Kyle O'Brien
David Majercak
Xavier Fernandes
Richard Edgar
Jingya Chen
Harsha Nori
Dean Carignan
Eric Horvitz
Forough Poursabzi-Sangde
LLMSV
56
10
0
18 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Zhibo Wang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
49
3
0
17 Nov 2024
Developer Perspectives on Licensing and Copyright Issues Arising from Generative AI for Software Development
Trevor Stalnaker
Nathan Wintersgill
Oscar Chaparro
Laura A. Heymann
M. D. Penta
Daniel M German
Denys Poshyvanyk
20
0
0
16 Nov 2024
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
Jianfeng Chi
Ujjwal Karn
Hongyuan Zhan
Eric Michael Smith
Javier Rando
Yiming Zhang
Kate Plawiak
Zacharie Delpierre Coudert
Kartikeya Upasani
Mahesh Pasupuleti
MLLM
3DH
34
19
0
15 Nov 2024
Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey
Xuannan Liu
Xing Cui
Peipei Li
Zekun Li
Huaibo Huang
Shuhan Xia
Miaoxuan Zhang
Yueying Zou
Ran He
AAML
60
6
0
14 Nov 2024
DROJ: A Prompt-Driven Attack against Large Language Models
Leyang Hu
Boran Wang
24
0
0
14 Nov 2024
New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook
Meng Yang
Tianqing Zhu
Chi Liu
Wanlei Zhou
Shui Yu
Philip S. Yu
AAML
ELM
PILM
53
1
0
12 Nov 2024
Artificial Intelligence for Biomedical Video Generation
Linyuan Li
Jianing Qiu
Anujit Saha
Lin Li
Poyuan Li
Mengxian He
Ziyu Guo
Wu Yuan
VGen
55
1
0
12 Nov 2024
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Khaoula Chehbouni
Jonathan Colaço-Carr
Yash More
Jackie CK Cheung
G. Farnadi
73
0
0
12 Nov 2024
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs
Ruben Härle
Felix Friedrich
Manuel Brack
Bjorn Deiseroth
P. Schramowski
Kristian Kersting
18
0
0
11 Nov 2024
"I Always Felt that Something Was Wrong.": Understanding Compliance Risks and Mitigation Strategies when Professionals Use Large Language Models
Siying Hu
Piaohong Wang
Yaxing Yao
Zhicong Lu
AILaw
PILM
37
0
0
07 Nov 2024
Diversity Helps Jailbreak Large Language Models
Weiliang Zhao
Daniel Ben-Levi
Wei Hao
Junfeng Yang
Chengzhi Mao
AAML
76
0
0
06 Nov 2024
Previous
1
2
3
4
5
6
...
11
12
13
Next