Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 938 papers shown
Title
LLM as OS, Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem
Yingqiang Ge
Yujie Ren
Wenyue Hua
Shuyuan Xu
Juntao Tan
Yongfeng Zhang
LLMAG
17
27
0
06 Dec 2023
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
Xuanimng Cui
Alejandro Aparcedo
Young Kyun Jang
Ser-Nam Lim
AAML
VLM
14
38
0
06 Dec 2023
Scaling Laws for Adversarial Attacks on Language Model Activations
Stanislav Fort
10
14
0
05 Dec 2023
Prompt Optimization via Adversarial In-Context Learning
Do Xuan Long
Yiran Zhao
Hannah Brown
Yuxi Xie
James Xu Zhao
Nancy F. Chen
Kenji Kawaguchi
Michael Qizhe Xie
Junxian He
44
11
0
05 Dec 2023
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
Anay Mehrotra
Manolis Zampetakis
Paul Kassianik
Blaine Nelson
Hyrum Anderson
Yaron Singer
Amin Karbasi
30
201
0
04 Dec 2023
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
Yifan Yao
Jinhao Duan
Kaidi Xu
Yuanfang Cai
Eric Sun
Yue Zhang
PILM
ELM
24
468
0
04 Dec 2023
Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective
Víctor Gallego
13
4
0
04 Dec 2023
VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models
Xiang Li
Qianli Shen
Kenji Kawaguchi
27
4
0
29 Nov 2023
MMA-Diffusion: MultiModal Attack on Diffusion Models
Yijun Yang
Ruiyuan Gao
Xiaosen Wang
Tsung-Yi Ho
Nan Xu
Qiang Xu
27
61
0
29 Nov 2023
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
Haoqin Tu
Chenhang Cui
Zijun Wang
Yiyang Zhou
Bingchen Zhao
Junlin Han
Wangchunshu Zhou
Huaxiu Yao
Cihang Xie
MLLM
45
70
0
27 Nov 2023
Universal Jailbreak Backdoors from Poisoned Human Feedback
Javier Rando
Florian Tramèr
13
60
0
24 Nov 2023
Transfer Attacks and Defenses for Large Language Models on Coding Tasks
Chi Zhang
Zifan Wang
Ravi Mangal
Matt Fredrikson
Limin Jia
Corina S. Pasareanu
AAML
SILM
17
1
0
22 Nov 2023
Evil Geniuses: Delving into the Safety of LLM-based Agents
Yu Tian
Xiao Yang
Jingyuan Zhang
Yinpeng Dong
Hang Su
LLMAG
AAML
39
54
0
20 Nov 2023
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information
Zhengmian Hu
Gang Wu
Saayan Mitra
Ruiyi Zhang
Tong Sun
Heng-Chiao Huang
Vishy Swaminathan
16
23
0
20 Nov 2023
Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems
Guangjing Wang
Ce Zhou
Yuanda Wang
Bocheng Chen
Hanqing Guo
Qiben Yan
AAML
SILM
53
3
0
20 Nov 2023
Hijacking Large Language Models via Adversarial In-Context Learning
Yao Qiang
Xiangyu Zhou
Dongxiao Zhu
30
32
0
16 Nov 2023
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
Nan Xu
Fei Wang
Ben Zhou
Bangzheng Li
Chaowei Xiao
Muhao Chen
24
55
0
16 Nov 2023
Automatic Engineering of Long Prompts
Cho-Jui Hsieh
Si Si
Felix X. Yu
Inderjit S. Dhillon
VLM
19
8
0
16 Nov 2023
Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework
Matthew Pisano
Peter Ly
Abraham Sanders
Bingsheng Yao
Dakuo Wang
T. Strzalkowski
Mei Si
AAML
17
23
0
16 Nov 2023
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
Yuanpu Cao
Bochuan Cao
Jinghui Chen
18
23
0
15 Nov 2023
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities
Lingbo Mo
Boshi Wang
Muhao Chen
Huan Sun
29
27
0
15 Nov 2023
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
Yuanwei Wu
Xiang Li
Yixin Liu
Pan Zhou
Lichao Sun
8
58
0
15 Nov 2023
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
Zhexin Zhang
Junxiao Yang
Pei Ke
Fei Mi
Hongning Wang
Minlie Huang
AAML
16
113
0
15 Nov 2023
Alignment is not sufficient to prevent large language models from generating harmful information: A psychoanalytic perspective
Zi Yin
Wei Ding
Jia Liu
25
1
0
14 Nov 2023
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily
Peng Ding
Jun Kuang
Dan Ma
Xuezhi Cao
Yunsen Xian
Jiajun Chen
Shujian Huang
AAML
14
95
0
14 Nov 2023
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
Suyu Ge
Chunting Zhou
Rui Hou
Madian Khabsa
Yi-Chia Wang
Qifan Wang
Jiawei Han
Yuning Mao
AAML
LRM
11
93
0
13 Nov 2023
Prompt have evil twins
Rimon Melamed
Lucas H. McCabe
T. Wakhare
Yejin Kim
H. H. Huang
Enric Boix-Adsera
14
3
0
13 Nov 2023
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
Sheng Liu
Haotian Ye
Lei Xing
James Y. Zou
26
83
0
11 Nov 2023
Intentional Biases in LLM Responses
Nicklaus Badyal
Derek Jacoby
Yvonne Coady
8
4
0
11 Nov 2023
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
Yichen Gong
Delong Ran
Jinyuan Liu
Conglei Wang
Tianshuo Cong
Anyu Wang
Sisi Duan
Xiaoyun Wang
MLLM
129
117
0
09 Nov 2023
Conversational AI Threads for Visualizing Multidimensional Datasets
Matt-Heun Hong
Anamaria Crisan
14
8
0
09 Nov 2023
Removing RLHF Protections in GPT-4 via Fine-Tuning
Qiusi Zhan
Richard Fang
R. Bindu
Akul Gupta
Tatsunori Hashimoto
Daniel Kang
MU
AAML
13
91
0
09 Nov 2023
Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?
C. D. Freeman
Laura J. Culp
Aaron T Parisi
Maxwell Bileschi
Gamaleldin F. Elsayed
...
Peter J. Liu
Roman Novak
Yundi Qian
Noah Fiedel
Jascha Narain Sohl-Dickstein
AAML
30
2
0
08 Nov 2023
Do LLMs exhibit human-like response biases? A case study in survey design
Lindia Tjuatja
Valerie Chen
Sherry Tongshuang Wu
Ameet Talwalkar
Graham Neubig
21
79
0
07 Nov 2023
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
Rusheb Shah
Quentin Feuillade--Montixi
Soroush Pour
Arush Tagade
Stephen Casper
Javier Rando
21
122
0
06 Nov 2023
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
Xuan Li
Zhanke Zhou
Jianing Zhu
Jiangchao Yao
Tongliang Liu
Bo Han
37
150
0
06 Nov 2023
Can LLMs Follow Simple Rules?
Norman Mu
Sarah Chen
Zifan Wang
Sizhe Chen
David Karamardian
Lulwa Aljeraisy
Basel Alomair
Dan Hendrycks
David A. Wagner
ALM
18
26
0
06 Nov 2023
Market Concentration Implications of Foundation Models
Jai Vipra
Anton Korinek
ELM
29
16
0
02 Nov 2023
Implicit Chain of Thought Reasoning via Knowledge Distillation
Yuntian Deng
Kiran Prasad
Roland Fernandez
P. Smolensky
Vishrav Chaudhary
Stuart M. Shieber
ReLM
LRM
14
43
0
02 Nov 2023
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Sam Toyer
Olivia Watkins
Ethan Mendes
Justin Svegliato
Luke Bailey
...
Karim Elmaaroufi
Pieter Abbeel
Trevor Darrell
Alan Ritter
Stuart J. Russell
14
71
0
02 Nov 2023
Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield
Jinhwa Kim
Ali Derakhshan
Ian G. Harris
AAML
86
16
0
31 Oct 2023
BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Pranav M. Gade
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALM
AI4MH
9
23
0
31 Oct 2023
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALM
15
80
0
31 Oct 2023
Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
Leo Schwinn
David Dobre
Stephan Günnemann
Gauthier Gidel
AAML
ELM
18
37
0
30 Oct 2023
When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations
Aleksandar Petrov
Philip H. S. Torr
Adel Bibi
VPVLM
25
21
0
30 Oct 2023
BERT Lost Patience Won't Be Robust to Adversarial Slowdown
Zachary Coalson
Gabriel Ritter
Rakesh Bobba
Sanghyun Hong
AAML
19
1
0
29 Oct 2023
AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors
You-Ming Chang
Chen Yeh
Wei-Chen Chiu
Ning Yu
VPVLM
VLM
64
21
0
26 Oct 2023
Self-Guard: Empower the LLM to Safeguard Itself
Zezhong Wang
Fangkai Yang
Lu Wang
Pu Zhao
Hongru Wang
Liang Chen
Qingwei Lin
Kam-Fai Wong
57
28
0
24 Oct 2023
Unnatural language processing: How do language models handle machine-generated prompts?
Corentin Kervadec
Francesca Franzon
Marco Baroni
15
5
0
24 Oct 2023
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Sicheng Zhu
Ruiyi Zhang
Bang An
Gang Wu
Joe Barrow
Zichao Wang
Furong Huang
A. Nenkova
Tong Sun
SILM
AAML
30
40
0
23 Oct 2023
Previous
1
2
3
...
16
17
18
19
Next