Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2309.01446
Cited By
v1
v2
v3 (latest)
Open Sesame! Universal Black Box Jailbreaking of Large Language Models
Applied Sciences (Appl. Sci.), 2023
4 September 2023
Raz Lapid
Ron Langberg
Moshe Sipper
AAML
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Open Sesame! Universal Black Box Jailbreaking of Large Language Models"
50 / 94 papers shown
TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization
Yanting Wang
Runpeng Geng
Jinghui Chen
Minhao Cheng
Jinyuan Jia
279
0
0
23 Nov 2025
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
Yunhao Chen
Xin Wang
Juncheng Li
Yixu Wang
Jie Li
Yan Teng
Yingchun Wang
Xingjun Ma
AAML
278
0
0
16 Nov 2025
Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
Hamin Koo
Minseon Kim
Jaehyung Kim
84
1
0
03 Nov 2025
Diffusion LLMs are Natural Adversaries for any LLM
David Lüdke
Tom Wollschlager
Paul Ungermann
Stephan Günnemann
Leo Schwinn
DiffM
197
0
0
31 Oct 2025
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
Hanbin Hong
Shuya Feng
Nima Naderloui
Shenao Yan
Jingyu Zhang
Biying Liu
Ali Arastehfard
Heqing Huang
Yuan Hong
AAML
230
0
0
17 Oct 2025
LLM Jailbreak Detection for (Almost) Free!
Guorui Chen
Yifan Xia
Xiaojun Jia
Ruoyao Xiao
Juil Sock
Jindong Gu
110
3
0
18 Sep 2025
Semantic Representation Attack against Aligned Large Language Models
Jiawei Lian
Jianhong Pan
L. Wang
Yi Wang
Shaohui Mei
Lap-Pui Chau
242
1
0
18 Sep 2025
HAMSA: Hijacking Aligned Compact Models via Stealthy Automation
Alexey Krylov
Iskander Vagizov
Dmitrii Korzh
Maryam Douiba
Azidine Guezzaz
Vladimir Kokh
Sergey D. Erokhin
Elena Tutubalina
Oleg Y. Rogov
108
2
0
22 Aug 2025
SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks
Xiangman Li
Xiaodong Wu
Qi Li
Jianbing Ni
Rongxing Lu
AAML
MU
KELM
100
0
0
21 Aug 2025
Too Easily Fooled? Prompt Injection Breaks LLMs on Frustratingly Simple Multiple-Choice Questions
Xuyang Guo
Zekai Huang
Zhao Song
Jiahao Zhang
LRM
140
3
0
16 Aug 2025
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Huizhen Shu
Xuying Li
Qirui Wang
Yuji Kosuga
Mengqiu Tian
Zhuo Li
AAML
SILM
186
0
0
14 Aug 2025
GPT, But Backwards: Exactly Inverting Language Model Outputs
Adrians Skapars
Edoardo Manino
Youcheng Sun
Lucas C. Cordeiro
163
0
0
02 Jul 2025
VERA: Variational Inference Framework for Jailbreaking Large Language Models
Anamika Lochab
Lu Yan
Patrick Pynadath
Xiangyu Zhang
Ruqi Zhang
AAML
VLM
359
1
0
27 Jun 2025
MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models
Mingyu Yu
Wei Wang
Y. X. Wei
Sujuan Qin
Fei Gao
Wenmin Li
AAML
394
0
0
29 May 2025
Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
Taiye Chen
Zeming Wei
Ang Li
Yisen Wang
AAML
275
7
0
21 May 2025
BadNAVer: Exploring Jailbreak Attacks On Vision-and-Language Navigation
Wenqi Lyu
Zerui Li
Yanyuan Qiao
Qi Wu
AAML
665
1
0
18 May 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David Evans
LLMSV
520
7
0
23 Apr 2025
A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models
Carlos Peláez-González
Andrés Herrera-Poyatos
Cristina Zuheros
David Herrera-Poyatos
Virilo Tejedor
F. Herrera
AAML
250
1
0
07 Apr 2025
Don't Lag, RAG: Training-Free Adversarial Detection Using RAG
Roie Kazoom
Raz Lapid
Moshe Sipper
Ofer Hadar
VLM
ObjD
AAML
415
5
0
07 Apr 2025
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy
Computer Vision and Pattern Recognition (CVPR), 2025
Joonhyun Jeong
Seyun Bae
Yeonsung Jung
Jaeryong Hwang
Eunho Yang
AAML
370
17
0
26 Mar 2025
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration
Andy Zhou
Kevin E. Wu
Francesco Pinto
Zhongfu Chen
Yi Zeng
Yu Yang
Shuang Yang
Sanmi Koyejo
James Zou
Bo Li
LLMAG
AAML
304
15
0
20 Mar 2025
AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
Dillon Bowen
Ann-Kathrin Dombrowski
Adam Gleave
Chris Cundy
ELM
155
2
0
17 Mar 2025
JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2025
Shuyi Liu
Simiao Cui
Haoran Bu
Yuming Shang
Xi Zhang
ELM
200
2
0
26 Feb 2025
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs
Buyun Liang
Kwan Ho Ryan Chan
D. Thaker
Jinqi Luo
René Vidal
AAML
275
2
0
05 Feb 2025
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Neural Information Processing Systems (NeurIPS), 2024
Xuan Chen
Yuzhou Nie
Wenbo Guo
Xiangyu Zhang
415
39
0
28 Jan 2025
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
AAAI Conference on Artificial Intelligence (AAAI), 2025
Jonathan Nöther
Adish Singla
Goran Radanović
AAML
375
0
0
14 Jan 2025
Global Challenge for Safe and Secure LLMs Track 1
Yang Liu
Yihao Huang
Yang Liu
Peng Yan Tan
Weng Kuan Yau
...
Yan Wang
Rick Siow Mong Goh
Liangli Zhen
Yingjie Zhang
Zhe Zhao
ELM
AILaw
197
3
0
21 Nov 2024
DROJ: A Prompt-Driven Attack against Large Language Models
Leyang Hu
Boran Wang
88
1
0
14 Nov 2024
Diversity Helps Jailbreak Large Language Models
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Weiliang Zhao
Daniel Ben-Levi
Wei Hao
Junfeng Yang
Chengzhi Mao
AAML
1.1K
4
0
06 Nov 2024
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation
Zijun Wang
Haoqin Tu
J. Mei
Bingchen Zhao
Yanjie Wang
Cihang Xie
169
19
0
11 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
International Conference on Learning Representations (ICLR), 2024
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
277
22
0
09 Oct 2024
FlipAttack: Jailbreak LLMs via Flipping
Yue Liu
Xiaoxin He
Miao Xiong
Jinlan Fu
Shumin Deng
Bryan Hooi
AAML
240
41
0
02 Oct 2024
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Lijia Lv
Weigang Zhang
Xuehai Tang
Jie Wen
Feng Liu
Jizhong Han
Songlin Hu
AAML
138
6
0
11 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
341
8
0
05 Sep 2024
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Bang An
Sicheng Zhu
Ruiyi Zhang
Michael-Andrei Panaitescu-Liess
Yuancheng Xu
Furong Huang
AAML
373
29
0
01 Sep 2024
On the Robustness of Kolmogorov-Arnold Networks: An Adversarial Perspective
Tal Alter
Raz Lapid
Moshe Sipper
AAML
473
14
0
25 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Neural Information Processing Systems (NeurIPS), 2024
Jingtong Su
Mingyu Lee
SangKeun Lee
199
22
0
02 Aug 2024
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
Huiyu Xu
Wenhui Zhang
Peng Kuang
Feng Xiao
Rui Zheng
Yunhe Feng
Zhongjie Ba
Kui Ren
AAML
LLMAG
242
28
0
23 Jul 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
563
65
0
16 Jul 2024
Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation
Riccardo Cantini
Giada Cosenza
A. Orsino
Domenico Talia
AAML
355
14
0
11 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
333
197
0
05 Jul 2024
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything
Xiaotian Zou
Ke Li
Yongkang Chen
MLLM
254
6
0
01 Jul 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
409
60
0
26 Jun 2024
Large Language Models as Surrogate Models in Evolutionary Algorithms: A Preliminary Study
Swarm and Evolutionary Computation (Swarm Evol. Comput.), 2024
Hao Hao
Xiaoqun Zhang
Aimin Zhou
ELM
241
28
0
15 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
482
23
0
13 Jun 2024
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
Lin Lu
Hai Yan
Zenghui Yuan
Jiawen Shi
Wenqi Wei
Pin-Yu Chen
Pan Zhou
AAML
342
13
0
06 Jun 2024
Safeguarding Large Language Models: A Survey
Yi Dong
Ronghui Mu
Yanghao Zhang
Siqi Sun
Tianle Zhang
...
Yi Qi
Jinwei Hu
Jie Meng
Saddek Bensalem
Xiaowei Huang
OffRL
KELM
AILaw
257
68
0
03 Jun 2024
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
AAML
314
70
0
03 Jun 2024
Exploring Vulnerabilities and Protections in Large Language Models: A Survey
Frank Weizhen Liu
Chenhui Hu
AAML
218
12
0
01 Jun 2024
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Yang Liu
Tianyu Pang
Chao Du
Yihao Huang
Jindong Gu
Yang Liu
Simeng Qin
Min Lin
AAML
350
73
0
31 May 2024
1
2
Next