Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2309.00254
Cited By
Why do universal adversarial attacks work on large language models?: Geometry might be the answer
1 September 2023
Varshini Subhash
Anna Bialas
Weiwei Pan
Finale Doshi-Velez
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Why do universal adversarial attacks work on large language models?: Geometry might be the answer"
9 / 9 papers shown
Title
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process
Peiran Wang
Xiaogeng Liu
Chaowei Xiao
AAML
31
3
0
11 Oct 2024
Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference
Ke Shen
Mayank Kejriwal
37
0
0
04 Aug 2024
Can Large Language Models Automatically Jailbreak GPT-4V?
Yuanwei Wu
Yue Huang
Yixin Liu
Xiang Li
Pan Zhou
Lichao Sun
SILM
37
1
0
23 Jul 2024
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Ji-Rong Wen
61
39
0
14 Mar 2024
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
Yifan Zeng
Yiran Wu
Xiao Zhang
Huazheng Wang
Qingyun Wu
LLMAG
AAML
42
61
0
02 Mar 2024
A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models
Zihao Xu
Yi Liu
Gelei Deng
Yuekang Li
S. Picek
PILM
AAML
33
35
0
21 Feb 2024
Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks
Haz Sameen Shahgir
Xianghao Kong
Greg Ver Steeg
Yue Dong
16
5
0
22 Dec 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
496
0
01 Nov 2022
MINIMAL: Mining Models for Data Free Universal Adversarial Triggers
Swapnil Parekh
Yaman Kumar Singla
Somesh Singh
Changyou Chen
Balaji Krishnamurthy
R. Shah
AAML
16
3
0
25 Sep 2021
1