Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 938 papers shown
Title
An LLM can Fool Itself: A Prompt-Based Adversarial Attack
Xilie Xu
Keyi Kong
Ning Liu
Li-zhen Cui
Di Wang
Jingfeng Zhang
Mohan S. Kankanhalli
AAML
SILM
25
68
0
20 Oct 2023
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Yupei Liu
Yuqi Jia
Runpeng Geng
Jinyuan Jia
Neil Zhenqiang Gong
SILM
LLMAG
16
58
0
19 Oct 2023
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
Melanie Sclar
Yejin Choi
Yulia Tsvetkov
Alane Suhr
28
296
0
17 Oct 2023
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
Traian Rebedea
R. Dinu
Makesh Narsimhan Sreedhar
Christopher Parisien
Jonathan Cohen
KELM
14
132
0
16 Oct 2023
Privacy in Large Language Models: Attacks, Defenses and Future Directions
Haoran Li
Yulin Chen
Jinglong Luo
Yan Kang
Xiaojin Zhang
Qi Hu
Chunkit Chan
Yangqiu Song
PILM
38
40
0
16 Oct 2023
Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks
Shuyu Jiang
Xingshu Chen
Rui Tang
24
22
0
16 Oct 2023
Digital Deception: Generative Artificial Intelligence in Social Engineering and Phishing
Marc Schmitt
Ivan Flechais
13
34
0
15 Oct 2023
Is Certifying
ℓ
p
\ell_p
ℓ
p
Robustness Still Worthwhile?
Ravi Mangal
Klas Leino
Zifan Wang
Kai Hu
Weicheng Yu
Corina S. Pasareanu
Anupam Datta
Matt Fredrikson
AAML
OOD
25
1
0
13 Oct 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Edgar Dobriban
Hamed Hassani
George J. Pappas
Eric Wong
AAML
48
568
0
12 Oct 2023
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Yangsibo Huang
Samyak Gupta
Mengzhou Xia
Kai Li
Danqi Chen
AAML
22
267
0
10 Oct 2023
Multilingual Jailbreak Challenges in Large Language Models
Yue Deng
Wenxuan Zhang
Sinno Jialin Pan
Lidong Bing
AAML
29
113
0
10 Oct 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
40
234
0
10 Oct 2023
The Emergence of Reproducibility and Generalizability in Diffusion Models
Huijie Zhang
Jinfan Zhou
Yifu Lu
Minzhe Guo
Peng Wang
Liyue Shen
Qing Qu
DiffM
23
2
0
08 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
44
523
0
05 Oct 2023
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
38
216
0
05 Oct 2023
Adversarial Machine Learning for Social Good: Reframing the Adversary as an Ally
Shawqi Al-Maliki
Adnan Qayyum
Hassan Ali
M. Abdallah
Junaid Qadir
D. Hoang
Dusit Niyato
Ala I. Al-Fuqaha
AAML
26
3
0
05 Oct 2023
Misusing Tools in Large Language Models With Visual Adversarial Examples
Xiaohan Fu
Zihan Wang
Shuheng Li
Rajesh K. Gupta
Niloofar Mireshghallah
Taylor Berg-Kirkpatrick
Earlence Fernandes
AAML
21
24
0
04 Oct 2023
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Xianjun Yang
Xiao Wang
Qi Zhang
Linda R. Petzold
William Yang Wang
Xun Zhao
Dahua Lin
18
160
0
04 Oct 2023
Low-Resource Languages Jailbreak GPT-4
Zheng-Xin Yong
Cristina Menghini
Stephen H. Bach
SILM
23
169
0
03 Oct 2023
Jailbreaker in Jail: Moving Target Defense for Large Language Models
Bocheng Chen
Advait Paliwal
Qiben Yan
AAML
29
14
0
03 Oct 2023
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu
Nan Xu
Muhao Chen
Chaowei Xiao
SILM
22
257
0
03 Oct 2023
Can Language Models be Instructed to Protect Personal Information?
Yang Chen
Ethan Mendes
Sauvik Das
Wei-ping Xu
Alan Ritter
PILM
19
34
0
03 Oct 2023
Ask Again, Then Fail: Large Language Models' Vacillations in Judgment
Qiming Xie
Zengzhi Wang
Yi Feng
Rui Xia
AAML
HILM
22
9
0
03 Oct 2023
Large Language Models Cannot Self-Correct Reasoning Yet
Jie Huang
Xinyun Chen
Swaroop Mishra
Huaixiu Steven Zheng
Adams Wei Yu
Xinying Song
Denny Zhou
ReLM
LRM
17
417
0
03 Oct 2023
LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model
Muhammad Ahmed Shah
Roshan S. Sharma
Hira Dhamyal
R. Olivier
Ankit Shah
...
Massa Baali
Soham Deshmukh
Michael Kuhlmann
Bhiksha Raj
Rita Singh
AAML
25
19
0
02 Oct 2023
What's the Magic Word? A Control Theory of LLM Prompting
Aman Bhargava
Cameron Witkowski
Manav Shah
Matt W. Thomson
LLMAG
51
29
0
02 Oct 2023
On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?
Hangfan Zhang
Zhimeng Guo
Huaisheng Zhu
Bochuan Cao
Lu Lin
Jinyuan Jia
Jinghui Chen
Di Wu
70
23
0
02 Oct 2023
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil
Peter Hase
Mohit Bansal
KELM
AAML
20
94
0
29 Sep 2023
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives
Elizabeth Seger
Noemi Dreksler
Richard Moulange
Emily Dardaman
Jonas Schuett
...
Emma Bluemke
Michael Aird
Patrick Levermore
Julian Hazell
Abhishek Gupta
11
40
0
29 Sep 2023
Language Models as a Service: Overview of a New Paradigm and its Challenges
Emanuele La Malfa
Aleksandar Petrov
Simon Frieder
Christoph Weinhuber
Ryan Burnell
Raza Nazar
Anthony Cohn
Nigel Shadbolt
Michael Wooldridge
ALM
ELM
30
3
0
28 Sep 2023
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Lorenzo Pacchiardi
A. J. Chan
Sören Mindermann
Ilan Moscovitz
Alexa Y. Pan
Y. Gal
Owain Evans
J. Brauner
LLMAG
HILM
17
48
0
26 Sep 2023
Large Language Model Alignment: A Survey
Tianhao Shen
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
14
176
0
26 Sep 2023
Can LLM-Generated Misinformation Be Detected?
Canyu Chen
Kai Shu
DeLMO
29
157
0
25 Sep 2023
ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning
Hosein Hasanbeig
Hiteshi Sharma
Leo Betthauser
Felipe Vieira Frujeri
Ida Momennejad
30
14
0
24 Sep 2023
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Tianle Li
Siyuan Zhuang
...
Zi Lin
Eric P. Xing
Joseph E. Gonzalez
Ion Stoica
Haotong Zhang
22
173
0
21 Sep 2023
Knowledge Sanitization of Large Language Models
Yoichi Ishibashi
Hidetoshi Shimodaira
KELM
16
19
0
21 Sep 2023
How Robust is Google's Bard to Adversarial Image Attacks?
Yinpeng Dong
Huanran Chen
Jiawei Chen
Zhengwei Fang
X. Yang
Yichi Zhang
Yu Tian
Hang Su
Jun Zhu
AAML
16
101
0
21 Sep 2023
Model Leeching: An Extraction Attack Targeting LLMs
Lewis Birch
William Hackett
Stefan Trawicki
N. Suri
Peter Garraghan
19
13
0
19 Sep 2023
LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins
Umar Iqbal
Tadayoshi Kohno
Franziska Roesner
ELM
SILM
63
45
0
19 Sep 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
113
300
0
19 Sep 2023
Understanding Catastrophic Forgetting in Language Models via Implicit Inference
Suhas Kotha
Jacob Mitchell Springer
Aditi Raghunathan
CLL
23
58
0
18 Sep 2023
Are You Worthy of My Trust?: A Socioethical Perspective on the Impacts of Trustworthy AI Systems on the Environment and Human Society
Jamell Dacon
SILM
13
1
0
18 Sep 2023
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Bochuan Cao
Yu Cao
Lu Lin
Jinghui Chen
AAML
23
132
0
18 Sep 2023
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
Federico Bianchi
Mirac Suzgun
Giuseppe Attanasio
Paul Röttger
Dan Jurafsky
Tatsunori Hashimoto
James Y. Zou
ALM
LM&MA
LRM
12
176
0
14 Sep 2023
RAIN: Your Language Models Can Align Themselves without Finetuning
Yuhui Li
Fangyun Wei
Jinjing Zhao
Chao Zhang
Hongyang R. Zhang
SILM
28
106
0
13 Sep 2023
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
Haoqin Tu
Bingchen Zhao
Chen Wei
Cihang Xie
MLLM
23
13
0
13 Sep 2023
Using Reed-Muller Codes for Classification with Rejection and Recovery
Daniel Fentham
David Parker
Mark Ryan
20
0
0
12 Sep 2023
Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models
Arka Dutta
Adel Khorramrouz
Sujan Dutta
Ashiqur R. KhudaBukhsh
6
0
0
08 Sep 2023
Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate
Yuancheng Xu
Chenghao Deng
Yanchao Sun
Ruijie Zheng
Xiyao Wang
Jieyu Zhao
Furong Huang
27
3
0
07 Sep 2023
Framework-Based Qualitative Analysis of Free Responses of Large Language Models: Algorithmic Fidelity
A. Amirova
T. Fteropoulli
Nafiso Ahmed
Martin R. Cowie
Joel Z. Leibo
13
5
0
06 Sep 2023
Previous
1
2
3
...
17
18
19
Next