Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 938 papers shown
Title
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
Lin Lu
Hai Yan
Zenghui Yuan
Jiawen Shi
Wenqi Wei
Pin-Yu Chen
Pan Zhou
AAML
44
8
0
06 Jun 2024
Ranking Manipulation for Conversational Search Engines
Samuel Pfrommer
Yatong Bai
Tanmay Gautam
Somayeh Sojoudi
SILM
39
4
0
05 Jun 2024
HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Tim Franzmeyer
Aleksandar Shtedritski
Samuel Albanie
Philip H. S. Torr
João F. Henriques
Jakob N. Foerster
27
1
0
05 Jun 2024
Defending Large Language Models Against Attacks With Residual Stream Activation Analysis
Amelia Kawasaki
Andrew Davis
Houssam Abbas
AAML
KELM
27
2
0
05 Jun 2024
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai
Yuchen Zhang
Shichang Zhang
Fan Yin
Difan Zou
Yisong Yue
Ziniu Hu
25
0
0
04 Jun 2024
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept
Guangliang Liu
Haitao Mao
Bochuan Cao
Zhiyu Xue
K. Johnson
Jiliang Tang
Rongrong Wang
LRM
32
9
0
04 Jun 2024
Dishonesty in Helpful and Harmless Alignment
Youcheng Huang
Jingkun Tang
Duanyu Feng
Zheng-Wei Zhang
Wenqiang Lei
Jiancheng Lv
Anthony G. Cohn
LLMSV
38
3
0
04 Jun 2024
Safeguarding Large Language Models: A Survey
Yi Dong
Ronghui Mu
Yanghao Zhang
Siqi Sun
Tianle Zhang
...
Yi Qi
Jinwei Hu
Jie Meng
Saddek Bensalem
Xiaowei Huang
OffRL
KELM
AILaw
35
17
0
03 Jun 2024
Decoupled Alignment for Robust Plug-and-Play Adaptation
Haozheng Luo
Jiahao Yu
Wenxin Zhang
Jialong Li
Jerry Yao-Chieh Hu
Xingyu Xing
Han Liu
43
10
0
03 Jun 2024
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
Diego Dorn
Alexandre Variengien
Charbel-Raphaël Ségerie
Vincent Corruble
24
7
0
03 Jun 2024
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min-Bin Lin
AAML
60
29
0
03 Jun 2024
BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models
Jiaqi Xue
Meng Zheng
Yebowen Hu
Fei Liu
Xun Chen
Qian Lou
AAML
SILM
24
25
0
03 Jun 2024
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
Andis Draguns
Andrew Gritsevskiy
S. Motwani
Charlie Rogers-Smith
Jeffrey Ladish
Christian Schroeder de Witt
40
2
0
03 Jun 2024
Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis
Timo Kaufmann
Eyke Hüllermeier
Samuel Albanie
Robert Mullins
SyDa
41
9
0
02 Jun 2024
Teams of LLM Agents can Exploit Zero-Day Vulnerabilities
Richard Fang
Antony Kellermann
Akul Gupta
Qiusi Zhan
Richard Fang
R. Bindu
Daniel Kang
LLMAG
38
28
0
02 Jun 2024
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Xiaojun Jia
Tianyu Pang
Chao Du
Yihao Huang
Jindong Gu
Yang Liu
Xiaochun Cao
Min-Bin Lin
AAML
44
22
0
31 May 2024
Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training
Feiteng Fang
Yuelin Bai
Shiwen Ni
Min Yang
Xiaojun Chen
Ruifeng Xu
AAML
RALM
39
30
0
31 May 2024
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
38
33
0
31 May 2024
Preemptive Answer "Attacks" on Chain-of-Thought Reasoning
Rongwu Xu
Zehan Qi
Wei Xu
LRM
SILM
59
6
0
31 May 2024
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
Jiahao Yu
Haozheng Luo
Jerry Yao-Chieh Hu
Wenbo Guo
Han Liu
Xinyu Xing
38
18
0
31 May 2024
Phantom: General Trigger Attacks on Retrieval Augmented Language Generation
Harsh Chaudhari
Giorgio Severi
John Abascal
Matthew Jagielski
Christopher A. Choquette-Choo
Milad Nasr
Cristina Nita-Rotaru
Alina Oprea
SILM
AAML
72
28
0
30 May 2024
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
Haibo Jin
Andy Zhou
Joe D. Menke
Haohan Wang
38
11
0
30 May 2024
Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools
Varun Magesh
Faiz Surani
Matthew Dahl
Mirac Suzgun
Christopher D. Manning
Daniel E. Ho
HILM
ELM
AILaw
27
66
0
30 May 2024
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks
Chen Xiong
Xiangyu Qi
Pin-Yu Chen
Tsung-Yi Ho
AAML
32
18
0
30 May 2024
Efficient LLM-Jailbreaking by Introducing Visual Modality
Zhenxing Niu
Yuyao Sun
Haodong Ren
Haoxuan Ji
Quan Wang
Xiaoke Ma
Gang Hua
Rong Jin
33
2
0
30 May 2024
AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization
Jiawei Chen
Xiao Yang
Zhengwei Fang
Yu Tian
Yinpeng Dong
Zhaoxia Yin
Hang Su
27
1
0
30 May 2024
AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi
Yangsibo Huang
Yi Zeng
Edoardo Debenedetti
Jonas Geiping
...
Chaowei Xiao
Bo-wen Li
Dawn Song
Peter Henderson
Prateek Mittal
AAML
43
10
0
29 May 2024
Toxicity Detection for Free
Zhanhao Hu
Julien Piet
Geng Zhao
Jiantao Jiao
David A. Wagner
32
3
0
29 May 2024
A Theoretical Understanding of Self-Correction through In-context Alignment
Yifei Wang
Yuyang Wu
Zeming Wei
Stefanie Jegelka
Yisen Wang
LRM
36
13
0
28 May 2024
LLMs and Memorization: On Quality and Specificity of Copyright Compliance
Felix Müller
Rebekka Görge
Anna K. Bernzen
Janna C Pirk
Maximilian Poretschkin
22
8
0
28 May 2024
Exploiting LLM Quantization
Kazuki Egashira
Mark Vero
Robin Staab
Jingxuan He
Martin Vechev
MQ
27
12
0
28 May 2024
White-box Multimodal Jailbreaks Against Large Vision-Language Models
Ruofan Wang
Xingjun Ma
Hanxu Zhou
Chuanjun Ji
Guangnan Ye
Yu-Gang Jiang
AAML
VLM
41
17
0
28 May 2024
Improved Generation of Adversarial Examples Against Safety-aligned LLMs
Qizhang Li
Yiwen Guo
Wangmeng Zuo
Hao Chen
AAML
SILM
28
5
0
28 May 2024
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
Yuanpu Cao
Tianrong Zhang
Bochuan Cao
Ziyi Yin
Lu Lin
Fenglong Ma
Jinghui Chen
LLMSV
29
19
0
28 May 2024
Learning diverse attacks on large language models for robust red-teaming and safety tuning
Seanie Lee
Minsu Kim
Lynn Cherif
David Dobre
Juho Lee
...
Kenji Kawaguchi
Gauthier Gidel
Yoshua Bengio
Nikolay Malkin
Moksh Jain
AAML
58
12
0
28 May 2024
Cross-Modal Safety Alignment: Is textual unlearning all you need?
Trishna Chakraborty
Erfan Shayegani
Zikui Cai
Nael B. Abu-Ghazaleh
M. Salman Asif
Yue Dong
A. Roy-Chowdhury
Chengyu Song
39
15
0
27 May 2024
Privacy-Aware Visual Language Models
Laurens Samson
Nimrod Barazani
S. Ghebreab
Yukiyasu Asano
PILM
VLM
42
1
0
27 May 2024
ReMoDetect: Reward Models Recognize Aligned LLM's Generations
Hyunseok Lee
Jihoon Tack
Jinwoo Shin
DeLMO
38
0
0
27 May 2024
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
Sheng-Hsuan Peng
Pin-Yu Chen
Matthew Hull
Duen Horng Chau
50
20
0
27 May 2024
Exploiting the Layered Intrinsic Dimensionality of Deep Models for Practical Adversarial Training
Enes Altinisik
Safa Messaoud
H. Sencar
Hassan Sajjad
Sanjay Chawla
AAML
43
0
0
27 May 2024
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Chia-Yi Hsu
Yu-Lin Tsai
Chih-Hsun Lin
Pin-Yu Chen
Chia-Mu Yu
Chun-ying Huang
44
32
0
27 May 2024
The Uncanny Valley: Exploring Adversarial Robustness from a Flatness Perspective
Nils Philipp Walter
Linara Adilova
Jilles Vreeken
Michael Kamp
AAML
43
2
0
27 May 2024
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character
Siyuan Ma
Weidi Luo
Yu Wang
Xiaogeng Liu
33
20
0
25 May 2024
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
Chak Tou Leong
Yi Cheng
Kaishuai Xu
Jian Wang
Hanlin Wang
Wenjie Li
AAML
43
17
0
25 May 2024
Efficient Adversarial Training in LLMs with Continuous Attacks
Sophie Xhonneux
Alessandro Sordoni
Stephan Günnemann
Gauthier Gidel
Leo Schwinn
AAML
37
44
0
24 May 2024
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
Guanlin Li
Kangjie Chen
Shudong Zhang
Jie M. Zhang
Tianwei Zhang
EGVM
47
10
0
24 May 2024
Robustifying Safety-Aligned Large Language Models through Clean Data Curation
Xiaoqun Liu
Jiacheng Liang
Muchao Ye
Zhaohan Xi
AAML
45
17
0
24 May 2024
Extracting Prompts by Inverting LLM Outputs
Collin Zhang
John X. Morris
Vitaly Shmatikov
36
15
0
23 May 2024
Semantic-guided Prompt Organization for Universal Goal Hijacking against LLMs
Yihao Huang
Chong Wang
Xiaojun Jia
Qing-Wu Guo
Felix Juefei Xu
Jian Zhang
G. Pu
Yang Liu
30
8
0
23 May 2024
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation
Jingnan Zheng
Han Wang
An Zhang
Tai D. Nguyen
Jun Sun
Tat-Seng Chua
LLMAG
38
14
0
23 May 2024
Previous
1
2
3
...
10
11
12
...
17
18
19
Next