Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 937 papers shown
Title
Adversarial Suffix Filtering: a Defense Pipeline for LLMs
David Khachaturov
Robert D. Mullins
AAML
2
0
0
14 May 2025
Layered Unlearning for Adversarial Relearning
Timothy Qian
Vinith M. Suriyakumar
Ashia C. Wilson
Dylan Hadfield-Menell
MU
12
0
0
14 May 2025
SecReEvalBench: A Multi-turned Security Resilience Evaluation Benchmark for Large Language Models
Huining Cui
Wei Liu
AAML
ELM
23
0
0
12 May 2025
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
Haoran Gu
Handing Wang
Yi Mei
Mengjie Zhang
Yaochu Jin
20
0
0
12 May 2025
System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection
Jiawei Guo
Haipeng Cai
SILM
AAML
18
0
0
10 May 2025
POISONCRAFT: Practical Poisoning of Retrieval-Augmented Generation for Large Language Models
Yangguang Shao
Xinjie Lin
Haozheng Luo
Chengshang Hou
G. Xiong
J. Yu
Junzheng Shi
SILM
42
0
0
10 May 2025
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities
Kalyan Nakka
Jimmy Dani
Ausmit Mondal
Nitesh Saxena
AAML
25
0
0
08 May 2025
RAP-SM: Robust Adversarial Prompt via Shadow Models for Copyright Verification of Large Language Models
Zhenhua Xu
Zhebo Wang
Maike Li
Wenpeng Xing
Chunqiang Hu
Chen Zhi
Meng Han
AAML
14
0
0
08 May 2025
A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models
Pedro Pinacho-Davidson
Fernando Gutierrez
Pablo Zapata
Rodolfo Vergara
Pablo Aqueveque
SILM
48
0
0
07 May 2025
Adversarial Attacks in Multimodal Systems: A Practitioner's Survey
Shashank Kapoor
Sanjay Surendranath Girija
Lakshit Arora
Dipen Pradhan
Ankit Shetgaonkar
Aman Raj
AAML
65
0
0
06 May 2025
Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents
Christian Schroeder de Witt
AAML
AI4CE
86
0
0
04 May 2025
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs
Haoming Yang
Ke Ma
X. Jia
Yingfei Sun
Qianqian Xu
Q. Huang
AAML
92
0
0
03 May 2025
Attack and defense techniques in large language models: A survey and new perspectives
Zhiyu Liao
Kang Chen
Yuanguo Lin
Kangkang Li
Yunxuan Liu
Hefeng Chen
Xingwang Huang
Yuanhui Yu
AAML
54
0
0
02 May 2025
Triggering Hallucinations in LLMs: A Quantitative Study of Prompt-Induced Hallucination in Large Language Models
Makoto Sato
HILM
LRM
20
0
0
01 May 2025
Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs
Pan Suo
Yu-ming Shang
San-Chuan Guo
Xi Zhang
SILM
AAML
45
0
0
30 Apr 2025
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction
Y. Chen
Haoran Li
Yuan Sui
Y. Liu
Yufei He
Y. Song
Bryan Hooi
AAML
SILM
61
0
0
29 Apr 2025
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models
Yi Zhou
Wenpeng Xing
Dezhang Kong
Changting Lin
Meng Han
MU
KELM
LLMSV
45
0
0
29 Apr 2025
A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning
Greg Gluch
Shafi Goldwasser
AAML
37
0
0
28 Apr 2025
JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift
Julien Piet
Xiao Huang
Dennis Jacob
Annabella Chow
Maha Alrashed
Geng Zhao
Zhanhao Hu
Chawin Sitawarin
Basel Alomair
David A. Wagner
AAML
63
0
0
28 Apr 2025
Prompt Injection Attack to Tool Selection in LLM Agents
Jiawen Shi
Zenghui Yuan
Guiyao Tie
Pan Zhou
Neil Zhenqiang Gong
Lichao Sun
LLMAG
51
0
0
28 Apr 2025
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary
Yakai Li
Jiekang Hu
Weiduan Sang
Luping Ma
Jing Xie
Weijuan Zhang
Aimin Yu
Shijie Zhao
Qingjia Huang
Qihang Zhou
AAML
52
0
0
28 Apr 2025
Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs
Mohammad Akbar-Tajari
Mohammad Taher Pilehvar
Mohammad Mahmoody
AAML
46
0
0
26 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
0
0
26 Apr 2025
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
J. Liu
Hangyu Guo
Ranjie Duan
Xingyuan Bu
Yancheng He
...
Yingshui Tan
Yanan Wu
Jihao Gu
Y. Li
J. Zhu
MLLM
97
0
0
25 Apr 2025
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models
Bang An
Shiyue Zhang
Mark Dredze
54
0
0
25 Apr 2025
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
Narek Maloyan
Dmitry Namiot
SILM
AAML
ELM
75
0
0
25 Apr 2025
NoEsis: Differentially Private Knowledge Transfer in Modular LLM Adaptation
Rob Romijnders
Stefanos Laskaridis
Ali Shahin Shamsabadi
Hamed Haddadi
57
0
0
25 Apr 2025
Safety Pretraining: Toward the Next Generation of Safe AI
Pratyush Maini
Sachin Goyal
Dylan Sam
Alex Robey
Yash Savani
Yiding Jiang
Andy Zou
Zacharcy C. Lipton
J. Zico Kolter
50
0
0
23 Apr 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David E. Evans
LLMSV
74
0
0
23 Apr 2025
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
Ivan Evtimov
Arman Zharmagambetov
Aaron Grattafiori
Chuan Guo
Kamalika Chaudhuri
AAML
33
0
0
22 Apr 2025
DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization
Xinzhe Huang
Kedong Xiu
T. Zheng
Churui Zeng
Wangze Ni
Zhan Qiin
K. Ren
C. L. P. Chen
AAML
28
0
0
21 Apr 2025
Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models
Tri Nguyen
Lohith Srikanth Pentapalli
Magnus Sieverding
Laurah Turner
Seth Overla
...
Michael Gharib
Matt Kelleher
Michael Shukis
Cameron Pawlik
Kelly Cohen
51
0
0
21 Apr 2025
Manipulating Multimodal Agents via Cross-Modal Prompt Injection
Le Wang
Zonghao Ying
Tianyuan Zhang
Siyuan Liang
Shengshan Hu
Mingchuan Zhang
A. Liu
Xianglong Liu
AAML
31
1
0
19 Apr 2025
DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification
Yu Li
Han Jiang
Zhihua Wei
AAML
29
0
0
18 Apr 2025
Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models
Yule Liu
Jingyi Zheng
Zhen Sun
Zifan Peng
Wenhan Dong
Zeyang Sha
Shiwen Cui
Weiqiang Wang
Xinlei He
OffRL
LRM
36
3
0
18 Apr 2025
Q-FAKER: Query-free Hard Black-box Attack via Controlled Generation
CheolWon Na
YunSeok Choi
Jee-Hyong Lee
AAML
37
0
0
18 Apr 2025
Antidistillation Sampling
Yash Savani
Asher Trockman
Zhili Feng
Avi Schwarzschild
Alexander Robey
Marc Finzi
J. Zico Kolter
44
0
0
17 Apr 2025
ELAB: Extensive LLM Alignment Benchmark in Persian Language
Zahra Pourbahman
Fatemeh Rajabi
Mohammadhossein Sadeghi
Omid Ghahroodi
Somaye Bakhshaei
Arash Amini
Reza Kazemi
M. Baghshah
27
0
0
17 Apr 2025
REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective
Zhihao Xu
Yongqi Tong
Xin Zhang
Jun Zhou
Xiting Wang
35
0
0
15 Apr 2025
DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks
Yupei Liu
Yuqi Jia
Jinyuan Jia
Dawn Song
Neil Zhenqiang Gong
AAML
34
0
0
15 Apr 2025
StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models
Yang Feng
Xudong Pan
AAML
31
0
0
14 Apr 2025
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?
Yanbo Wang
Jiyang Guan
Jian Liang
Ran He
43
0
0
14 Apr 2025
Ctrl-Z: Controlling AI Agents via Resampling
Aryan Bhatt
Cody Rushing
Adam Kaufman
Tyler Tracy
Vasil Georgiev
David Matolcsi
Akbir Khan
B. S.
AAML
30
1
0
14 Apr 2025
The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
Kristina Nikolić
Luze Sun
Jie Zhang
F. Tramèr
23
0
0
14 Apr 2025
CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent
Liang-bo Ning
Shijie Wang
Wenqi Fan
Qing Li
Xin Xu
Hao Chen
Feiran Huang
AAML
21
16
0
13 Apr 2025
The Structural Safety Generalization Problem
Julius Broomfield
Tom Gibbs
Ethan Kosak-Hine
George Ingebretsen
Tia Nasir
Jason Zhang
Reihaneh Iranmanesh
Sara Pieri
Reihaneh Rabbany
Kellin Pelrine
AAML
23
0
0
13 Apr 2025
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
Yunxiang Zhang
Muhammad Khalifa
Shitanshu Bhushan
Grant D Murphy
Lajanugen Logeswaran
Jaekyeom Kim
Moontae Lee
Honglak Lee
Lu Wang
LLMAG
ELM
62
0
0
13 Apr 2025
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
Jiahe Guo
Yulin Hu
Yang Deng
An Zhang
...
Xinyang Han
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
AAML
LLMSV
41
0
0
13 Apr 2025
Detecting Instruction Fine-tuning Attack on Language Models with Influence Function
Jiawei Li
TDI
AAML
33
0
0
12 Apr 2025
Feature-Aware Malicious Output Detection and Mitigation
Weilong Dong
Peiguang Li
Yu Tian
Xinyi Zeng
Fengdi Li
Sirui Wang
AAML
24
0
0
12 Apr 2025
1
2
3
4
...
17
18
19
Next