Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 938 papers shown
Title
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response
Tianrong Zhang
Bochuan Cao
Yuanpu Cao
Lu Lin
Prasenjit Mitra
Jinghui Chen
AAML
37
9
0
22 May 2024
Safety Alignment for Vision Language Models
Zhendong Liu
Yuanbi Nie
Yingshui Tan
Xiangyu Yue
Qiushi Cui
Chongjun Wang
Xiaoyong Zhu
Bo Zheng
VLM
MLLM
96
6
0
22 May 2024
How to Trace Latent Generative Model Generated Images without Artificial Watermark?
Zhenting Wang
Vikash Sehwag
Chen Chen
Lingjuan Lyu
Dimitris N. Metaxas
Shiqing Ma
WIGM
36
5
0
22 May 2024
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Rheeya Uppaal
Apratim De
Yiting He
Yiquao Zhong
Junjie Hu
29
7
0
22 May 2024
Securing the Future of GenAI: Policy and Technology
Mihai Christodorescu
Craven
S. Feizi
Neil Zhenqiang Gong
Mia Hoffmann
...
Jessica Newman
Emelia Probasco
Yanjun Qi
Khawaja Shams
Turek
SILM
38
3
0
21 May 2024
GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
Govind Ramesh
Yao Dou
Wei-ping Xu
PILM
32
10
0
21 May 2024
Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
Yuxi Li
Yi Liu
Yuekang Li
Ling Shi
Gelei Deng
Shengquan Chen
Kailong Wang
36
12
0
20 May 2024
Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors
Jiachen Sun
Changsheng Wang
Jiong Wang
Yiwei Zhang
Chaowei Xiao
AAML
VLM
34
2
0
17 May 2024
What is it for a Machine Learning Model to Have a Capability?
Jacqueline Harding
Nathaniel Sharadin
ELM
31
3
0
14 May 2024
Risks and Opportunities of Open-Source Generative AI
Francisco Eiras
Aleksander Petrov
Bertie Vidgen
Christian Schroeder
Fabio Pizzati
...
Matthew Jackson
Phillip H. S. Torr
Trevor Darrell
Y. Lee
Jakob N. Foerster
40
18
0
14 May 2024
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
Raghuveer Peri
Sai Muralidhar Jayanthi
S. Ronanki
Anshu Bhatia
Karel Mundnich
...
Srikanth Vishnubhotla
Daniel Garcia-Romero
S. Srinivasan
Kyu J. Han
Katrin Kirchhoff
AAML
32
3
0
14 May 2024
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition
Ziyang Zhang
Qizhen Zhang
Jakob N. Foerster
AAML
35
18
0
13 May 2024
PLeak: Prompt Leaking Attacks against Large Language Model Applications
Bo Hui
Haolin Yuan
Neil Gong
Philippe Burlina
Yinzhi Cao
LLMAG
AAML
SILM
31
33
0
10 May 2024
LLM-Generated Black-box Explanations Can Be Adversarially Helpful
R. Ajwani
Shashidhar Reddy Javaji
Frank Rudzicz
Zining Zhu
AAML
32
6
0
10 May 2024
Revisiting character-level adversarial attacks
Elias Abad Rocamora
Yongtao Wu
Fanghui Liu
Grigorios G. Chrysos
V. Cevher
AAML
24
3
0
07 May 2024
Beyond human subjectivity and error: a novel AI grading system
Alexandra Gobrecht
Felix Tuma
Moritz Möller
Thomas Zöller
Mark Zakhvatkin
Alexandra Wuttig
Holger Sommerfeldt
Sven Schütt
26
4
0
07 May 2024
A Causal Explainable Guardrails for Large Language Models
Zhixuan Chu
Yan Wang
Longfei Li
Zhibo Wang
Zhan Qin
Kui Ren
LLMSV
44
7
0
07 May 2024
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
Shang Shang
Xinqiang Zhao
Zhongjiang Yao
Yepeng Yao
Liya Su
Zijing Fan
Xiaodan Zhang
Zhengwei Jiang
55
4
0
06 May 2024
When LLMs Meet Cybersecurity: A Systematic Literature Review
Jie Zhang
Haoyu Bu
Hui Wen
Yu Chen
Lun Li
Hongsong Zhu
26
36
0
06 May 2024
PICLe: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning
Hyeong Kyu Choi
Yixuan Li
59
17
0
03 May 2024
ProFLingo: A Fingerprinting-based Intellectual Property Protection Scheme for Large Language Models
Heng Jin
Chaoyu Zhang
Shanghao Shi
W. Lou
Y. T. Hou
14
2
0
03 May 2024
Generative AI in Cybersecurity
Shivani Metta
Isaac Chang
Jack Parker
Michael P. Roman
Arturo F. Ehuan
34
4
0
02 May 2024
Evaluating and Mitigating Linguistic Discrimination in Large Language Models
Guoliang Dong
Haoyu Wang
Jun Sun
Xinyu Wang
45
3
0
29 Apr 2024
Towards Incremental Learning in Large Language Models: A Critical Review
M. Jovanovic
Peter Voss
ELM
CLL
KELM
26
5
0
28 Apr 2024
Exploring the Robustness of In-Context Learning with Noisy Labels
Chen Cheng
Xinzhi Yu
Haodong Wen
Jinsong Sun
Guanzhang Yue
Yihao Zhang
Zeming Wei
NoLa
24
6
0
28 Apr 2024
Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo
Stephen Zhao
Rob Brekelmans
Alireza Makhzani
Roger C. Grosse
32
31
0
26 Apr 2024
Human-Imperceptible Retrieval Poisoning Attacks in LLM-Powered Applications
Quan Zhang
Binqi Zeng
Chijin Zhou
Gwihwan Go
Heyuan Shi
Yu Jiang
SILM
AAML
32
19
0
26 Apr 2024
Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs
Valeriia Cherepanova
James Zou
AAML
31
4
0
26 Apr 2024
Near to Mid-term Risks and Opportunities of Open-Source Generative AI
Francisco Eiras
Aleksandar Petrov
Bertie Vidgen
Christian Schroeder de Witt
Fabio Pizzati
...
Paul Röttger
Philip H. S. Torr
Trevor Darrell
Y. Lee
Jakob N. Foerster
46
5
0
25 Apr 2024
Don't Say No: Jailbreaking LLM by Suppressing Refusal
Yukai Zhou
Wenjie Wang
AAML
34
15
0
25 Apr 2024
VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations
Sri Harsha Dumpala
Aman Jaiswal
Chandramouli Shama Sastry
E. Milios
Sageev Oore
Hassan Sajjad
VLM
CoGe
40
0
0
25 Apr 2024
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
Jo˜ao Monteiro
Étienne Marcotte
Pierre-Andre Noel
Valentina Zantedeschi
David Vázquez
Nicolas Chapados
Christopher Pal
Perouz Taslakian
33
5
0
23 Apr 2024
Rethinking LLM Memorization through the Lens of Adversarial Compression
Avi Schwarzschild
Zhili Feng
Pratyush Maini
Zachary Chase Lipton
J. Zico Kolter
39
40
0
23 Apr 2024
A Survey on the Real Power of ChatGPT
Ming Liu
Ran Liu
Ye Zhu
Hua Wang
Youyang Qu
Rongsheng Li
Yongpan Sheng
Wray L. Buntine
42
2
0
22 Apr 2024
Holistic Safety and Responsibility Evaluations of Advanced AI Models
Laura Weidinger
Joslyn Barnhart
Jenny Brennan
Christina Butterfield
Susie Young
...
Sebastian Farquhar
Lewis Ho
Iason Gabriel
Allan Dafoe
William S. Isaac
ELM
32
8
0
22 Apr 2024
Protecting Your LLMs with Information Bottleneck
Zichuan Liu
Zefan Wang
Linjie Xu
Jinyu Wang
Lei Song
Tianchun Wang
Chunlin Chen
Wei Cheng
Jiang Bian
KELM
AAML
56
15
0
22 Apr 2024
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Javier Rando
Francesco Croce
Kryvstof Mitka
Stepan Shabalin
Maksym Andriushchenko
Nicolas Flammarion
F. Tramèr
25
14
0
22 Apr 2024
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Anselm Paulus
Arman Zharmagambetov
Chuan Guo
Brandon Amos
Yuandong Tian
AAML
53
55
0
21 Apr 2024
Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
Narek Maloyan
Ekansh Verma
Bulat Nutfullin
Bislan Ashinov
46
7
0
21 Apr 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace
Kai Y. Xiao
R. Leike
Lilian Weng
Johannes Heidecke
Alex Beutel
SILM
47
115
0
19 Apr 2024
Advancing the Robustness of Large Language Models through Self-Denoised Smoothing
Jiabao Ji
Bairu Hou
Zhen Zhang
Guanhua Zhang
Wenqi Fan
Qing Li
Yang Zhang
Gaowen Liu
Sijia Liu
Shiyu Chang
AAML
30
5
0
18 Apr 2024
Uncovering Safety Risks of Large Language Models through Concept Activation Vector
Zhihao Xu
Ruixuan Huang
Changyu Chen
Shuai Wang
Xiting Wang
LLMSV
32
10
0
18 Apr 2024
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations
Christian Tomani
Kamalika Chaudhuri
Ivan Evtimov
Daniel Cremers
Mark Ibrahim
43
9
0
16 Apr 2024
Forcing Diffuse Distributions out of Language Models
Yiming Zhang
Avi Schwarzschild
Nicholas Carlini
Zico Kolter
Daphne Ippolito
ALM
DiffM
36
15
0
16 Apr 2024
Private Attribute Inference from Images with Vision-Language Models
Batuhan Tömekçe
Mark Vero
Robin Staab
Martin Vechev
VLM
PILM
60
6
0
16 Apr 2024
Learn Your Reference Model for Real Good Alignment
Alexey Gorbatovski
Boris Shaposhnikov
Alexey Malakhov
Nikita Surnachev
Yaroslav Aksenov
Ian Maksimov
Nikita Balagansky
Daniil Gavrilov
OffRL
52
26
0
15 Apr 2024
Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies
Brian Bartoldson
James Diffenderfer
Konstantinos Parasyris
B. Kailkhura
AAML
35
12
0
14 Apr 2024
JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
Yingchaojie Feng
Zhizhang Chen
Zhining Kang
Sijia Wang
Minfeng Zhu
Wei Zhang
Wei Chen
40
3
0
12 Apr 2024
Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts
Tianyu Zhang
Zixuan Zhao
Jiaqi Huang
Jingyu Hua
Sheng Zhong
27
0
0
12 Apr 2024
LLM Agents can Autonomously Exploit One-day Vulnerabilities
Richard Fang
R. Bindu
Akul Gupta
Daniel Kang
SILM
LLMAG
73
53
0
11 Apr 2024
Previous
1
2
3
...
11
12
13
...
17
18
19
Next