Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2309.01446
Cited By
v1
v2
v3 (latest)
Open Sesame! Universal Black Box Jailbreaking of Large Language Models
Applied Sciences (Appl. Sci.), 2023
4 September 2023
Raz Lapid
Ron Langberg
Moshe Sipper
AAML
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Open Sesame! Universal Black Box Jailbreaking of Large Language Models"
44 / 94 papers shown
Title
Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries
Jiahao Yu
Haozheng Luo
Jerry Yao-Chieh Hu
Wenbo Guo
Han Liu
Xinyu Xing
224
21
0
31 May 2024
A Theoretical Understanding of Self-Correction through In-context Alignment
Yifei Wang
Yuyang Wu
Zeming Wei
Stefanie Jegelka
Yisen Wang
LRM
222
51
0
28 May 2024
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition
International Conference on Machine Learning (ICML), 2024
Ziyang Zhang
Qizhen Zhang
Jakob N. Foerster
AAML
274
31
0
13 May 2024
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
Computer/law journal (JITPL), 2024
Shang Shang
Xinqiang Zhao
Zhongjiang Yao
Yepeng Yao
Liya Su
Zijing Fan
Xiaodan Zhang
Zhengwei Jiang
241
10
0
06 May 2024
When LLMs Meet Cybersecurity: A Systematic Literature Review
Jie Zhang
Haoyu Bu
Hui Wen
Yu Chen
Lun Li
Hongsong Zhu
340
132
0
06 May 2024
Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs
Valeriia Cherepanova
James Zou
AAML
282
8
0
26 Apr 2024
Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors
Raz Lapid
Almog Dubin
Moshe Sipper
AAML
176
5
0
18 Apr 2024
Ethical Framework for Responsible Foundational Models in Medical Imaging
Abhijit Das
Debesh Jha
Jasmer Sanjotra
Onkar Susladkar
Suramyaa Sarkar
A. Rauniyar
Nikhil Tomar
Vanshali Sharma
Ulas Bagci
MedIm
176
3
0
14 Apr 2024
Exploring the True Potential: Evaluating the Black-box Optimization Capability of Large Language Models
Beichen Huang
Xingyu Wu
Yu Zhou
Jibin Wu
Liang Feng
Ran Cheng
Kay Chen Tan
244
22
0
09 Apr 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
International Conference on Learning Representations (ICLR), 2024
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
624
352
0
02 Apr 2024
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao
Edoardo Debenedetti
Avi Schwarzschild
Maksym Andriushchenko
Francesco Croce
...
Nicolas Flammarion
George J. Pappas
F. Tramèr
Hamed Hassani
Eric Wong
ALM
ELM
AAML
311
263
0
28 Mar 2024
Optimization-based Prompt Injection Attack to LLM-as-a-Judge
Jiawen Shi
Zenghui Yuan
Yinuo Liu
Yue Huang
Pan Zhou
Lichao Sun
Neil Zhenqiang Gong
AAML
435
111
0
26 Mar 2024
XAI-Based Detection of Adversarial Attacks on Deepfake Detectors
Ben Pinhasov
Raz Lapid
Rony Ohayon
Moshe Sipper
Y. Aperstein
AAML
158
15
0
05 Mar 2024
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Mikayel Samvelyan
Sharath Chandra Raparthy
Andrei Lupu
Eric Hambro
Aram H. Markosyan
...
Minqi Jiang
Jack Parker-Holder
Jakob Foerster
Tim Rocktaschel
Roberta Raileanu
SyDa
192
139
0
26 Feb 2024
Multi-Bit Distortion-Free Watermarking for Large Language Models
Massieh Kordi Boroujeny
Ya Jiang
Kai Zeng
Brian L. Mark
WaLM
VLM
208
8
0
26 Feb 2024
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
Jiabao Ji
Bairu Hou
Avi Schwarzschild
George J. Pappas
Hamed Hassani
Yang Zhang
Eric Wong
Shiyu Chang
AAML
185
69
0
25 Feb 2024
Fast Adversarial Attacks on Language Models In One GPU Minute
Vinu Sankar Sadasivan
Shoumik Saha
Gaurang Sriramanan
Priyatham Kattakinda
Atoosa Malemir Chegini
Soheil Feizi
MIALM
278
63
0
23 Feb 2024
Coercing LLMs to do and reveal (almost) anything
Jonas Geiping
Alex Stein
Manli Shu
Khalid Saifullah
Yuxin Wen
Tom Goldstein
AAML
182
77
0
21 Feb 2024
Is the System Message Really Important to Jailbreaks in Large Language Models?
Xiaotian Zou
Yongkang Chen
Ke Li
164
20
0
20 Feb 2024
A StrongREJECT for Empty Jailbreaks
Alexandra Souly
Qingyuan Lu
Dillon Bowen
Tu Trinh
Elvis Hsieh
...
Pieter Abbeel
Justin Svegliato
Scott Emmons
Olivia Watkins
Sam Toyer
215
176
0
15 Feb 2024
PAL: Proxy-Guided Black-Box Attack on Large Language Models
Chawin Sitawarin
Norman Mu
David Wagner
Alexandre Araujo
ELM
174
44
0
15 Feb 2024
Attacking Large Language Models with Projected Gradient Descent
Simon Geisler
Tom Wollschlager
M. H. I. Abdalla
Johannes Gasteiger
Stephan Günnemann
AAML
SILM
217
92
0
14 Feb 2024
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
Leo Schwinn
David Dobre
Sophie Xhonneux
Gauthier Gidel
Stephan Gunnemann
AAML
390
77
0
14 Feb 2024
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
Xing-ming Guo
Fangxu Yu
Huan Zhang
Lianhui Qin
Bin Hu
AAML
353
143
0
13 Feb 2024
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
Xiangming Gu
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Ye Wang
Jing Jiang
Min Lin
LLMAG
LM&Ro
170
92
0
13 Feb 2024
Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
Gelei Deng
Yi Liu
Kailong Wang
Yuekang Li
Tianwei Zhang
Yang Liu
198
66
0
13 Feb 2024
Security and Privacy Challenges of Large Language Models: A Survey
B. Das
M. H. Amini
Yanzhao Wu
PILM
ELM
352
287
0
30 Jan 2024
Weak-to-Strong Jailbreaking on Large Language Models
Xuandong Zhao
Xianjun Yang
Tianyu Pang
Chao Du
Lei Li
Yu-Xiang Wang
William Y. Wang
788
87
0
30 Jan 2024
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2024
Michael Feffer
Anusha Sinha
Wesley Hanwen Deng
Zachary Chase Lipton
Hoda Heidari
AAML
361
102
0
29 Jan 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Conference on Fairness, Accountability and Transparency (FAccT), 2024
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
448
124
0
25 Jan 2024
When Large Language Models Meet Evolutionary Algorithms: Potential Enhancements and Challenges
Wang Chao
Jiaxuan Zhao
Licheng Jiao
Lingling Li
Fang Liu
Shuyuan Yang
361
19
0
19 Jan 2024
Evolutionary Computation in the Era of Large Language Model: Survey and Roadmap
Xingyu Wu
Sheng-hao Wu
Jibin Wu
Liang Feng
Kay Chen Tan
ELM
445
117
0
18 Jan 2024
All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
Kazuhiro Takemoto
309
31
0
18 Jan 2024
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
Zhuo Zhang
Guangyu Shen
Guanhong Tao
Shuyang Cheng
Xiangyu Zhang
245
21
0
08 Dec 2023
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
Neural Information Processing Systems (NeurIPS), 2023
Anay Mehrotra
Manolis Zampetakis
Paul Kassianik
Blaine Nelson
Hyrum Anderson
Yaron Singer
Amin Karbasi
253
412
0
04 Dec 2023
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
AAAI Conference on Artificial Intelligence (AAAI), 2023
Yichen Gong
Delong Ran
Jinyuan Liu
Conglei Wang
Tianshuo Cong
Anyu Wang
Sisi Duan
Xiaoyun Wang
MLLM
581
264
0
09 Nov 2023
Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
Leo Schwinn
David Dobre
Stephan Günnemann
Gauthier Gidel
AAML
ELM
206
59
0
30 Oct 2023
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Sicheng Zhu
Ruiyi Zhang
Bang An
Gang Wu
Joe Barrow
Zichao Wang
Furong Huang
A. Nenkova
Tong Sun
SILM
AAML
184
83
0
23 Oct 2023
Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks
Shuyu Jiang
Xingshu Chen
Rui Tang
247
31
0
16 Oct 2023
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
International Conference on Learning Representations (ICLR), 2023
Yangsibo Huang
Samyak Gupta
Mengzhou Xia
Kai Li
Danqi Chen
AAML
210
390
0
10 Oct 2023
Low-Resource Languages Jailbreak GPT-4
Zheng-Xin Yong
Cristina Menghini
Stephen H. Bach
SILM
394
261
0
03 Oct 2023
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
International Conference on Learning Representations (ICLR), 2023
Xiaogeng Liu
Nan Xu
Muhao Chen
Chaowei Xiao
SILM
269
528
0
03 Oct 2023
Can LLM-Generated Misinformation Be Detected?
International Conference on Learning Representations (ICLR), 2023
Canyu Chen
Kai Shu
DeLMO
685
231
0
25 Sep 2023
Patch of Invisibility: Naturalistic Physical Black-Box Adversarial Attacks on Object Detectors
Raz Lapid
Eylon Mizrahi
Moshe Sipper
AAML
265
3
0
07 Mar 2023
Previous
1
2