Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 938 papers shown
Title
Feature-Aware Malicious Output Detection and Mitigation
Weilong Dong
Peiguang Li
Yu Tian
Xinyi Zeng
Fengdi Li
Sirui Wang
AAML
24
0
0
12 Apr 2025
AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks
Charlotte Siska
Anush Sankaran
AAML
45
0
0
10 Apr 2025
Geneshift: Impact of different scenario shift on Jailbreaking LLM
Tianyi Wu
Zhiwei Xue
Yue Liu
Jiaheng Zhang
Bryan Hooi
See-Kiong Ng
36
0
0
10 Apr 2025
PR-Attack: Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization
Yang Jiao
X. Wang
Kai Yang
AAML
SILM
31
0
0
10 Apr 2025
Bridging the Gap Between Preference Alignment and Machine Unlearning
Xiaohua Feng
Yuyuan Li
Huwei Ji
Jiaming Zhang
L. Zhang
Tianyu Du
Chaochao Chen
MU
38
0
0
09 Apr 2025
NLP Security and Ethics, in the Wild
Heather Lent
Erick Galinkin
Yiyi Chen
Jens Myrup Pedersen
Leon Derczynski
Johannes Bjerva
SILM
42
0
0
09 Apr 2025
Bypassing Safety Guardrails in LLMs Using Humor
Pedro Cisneros-Velarde
29
0
0
09 Apr 2025
Separator Injection Attack: Uncovering Dialogue Biases in Large Language Models Caused by Role Separators
Xitao Li
H. Wang
Jiang Wu
Ting Liu
AAML
26
0
0
08 Apr 2025
StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization
Yiming Tang
Yi Fan
Chenxiao Yu
Tiankai Yang
Yue Zhao
Xiyang Hu
24
0
0
08 Apr 2025
A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models
Carlos Peláez-González
Andrés Herrera-Poyatos
Cristina Zuheros
David Herrera-Poyatos
Virilo Tejedor
F. Herrera
AAML
19
0
0
07 Apr 2025
Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models
Jiawei Lian
Jianhong Pan
L. Wang
Yi Wang
Shaohui Mei
Lap-Pui Chau
AAML
24
0
0
07 Apr 2025
Rethinking Reflection in Pre-Training
Essential AI
Darsh J Shah
Peter Rushton
Somanshu Singla
Mohit Parmar
...
Philip Monk
Platon Mazarakis
Ritvik Kapila
Saurabh Srivastava
Tim Romanski
ReLM
LRM
43
3
0
05 Apr 2025
Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability
Vishnu Kabir Chhabra
Mohammad Mahdi Khalili
AI4CE
28
0
0
05 Apr 2025
Exploiting Fine-Grained Skip Behaviors for Micro-Video Recommendation
Sanghyuck Lee
Sangkeun Park
Jaesung Lee
48
0
0
04 Apr 2025
The H-Elena Trojan Virus to Infect Model Weights: A Wake-Up Call on the Security Risks of Malicious Fine-Tuning
Virilo Tejedor
Cristina Zuheros
Carlos Peláez-González
David Herrera-Poyatos
Andrés Herrera-Poyatos
F. Herrera
24
0
0
04 Apr 2025
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment
Yifan Wang
Runjin Chen
Bolian Li
David Cho
Yihe Deng
Ruqi Zhang
Tianlong Chen
Zhangyang Wang
A. Grama
Junyuan Hong
SyDa
48
0
0
03 Apr 2025
PiCo: Jailbreaking Multimodal Large Language Models via
Pi
\textbf{Pi}
Pi
ctorial
Co
\textbf{Co}
Co
de Contextualization
Aofan Liu
Lulu Tang
Ting Pan
Yuguo Yin
Bin Wang
Ao Yang
MLLM
AAML
40
0
0
02 Apr 2025
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning
S. Chen
Xiao Yu
Ninareh Mehrabi
Rahul Gupta
Zhou Yu
Ruoxi Jia
AAML
LLMAG
47
0
0
02 Apr 2025
Emerging Cyber Attack Risks of Medical AI Agents
Jianing Qiu
Lin Li
Jiankai Sun
Hao Wei
Zhe Xu
K. Lam
Wu Yuan
AAML
28
1
0
02 Apr 2025
Representation Bending for Large Language Model Safety
Ashkan Yousefpour
Taeheon Kim
Ryan S. Kwon
Seungbeen Lee
Wonje Jeung
Seungju Han
Alvin Wan
Harrison Ngan
Youngjae Yu
Jonghyun Choi
AAML
ALM
KELM
52
0
0
02 Apr 2025
CyberBOT: Towards Reliable Cybersecurity Education via Ontology-Grounded Retrieval Augmented Generation
Chengshuai Zhao
Riccardo De Maria
Tharindu Kumarage
Kumar Satvik Chaudhary
Garima Agrawal
Yiwen Li
Jongchan Park
Yuli Deng
Y. Chen
Huan Liu
36
0
0
01 Apr 2025
Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots
Erfan Shayegani
G M Shahariar
Sara Abdali
Lei Yu
Nael B. Abu-Ghazaleh
Yue Dong
AAML
56
0
0
01 Apr 2025
Agents
Under
Siege
\textit{Agents Under Siege}
Agents Under Siege
: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
Rana Muhammad Shahroz Khan
Zhen Tan
Sukwon Yun
Charles Flemming
Tianlong Chen
AAML
LLMAG
Presented at
ResearchTrend Connect | LLMAG
on
23 Apr 2025
93
2
0
31 Mar 2025
Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms
Shuoming Zhang
Jiacheng Zhao
Ruiyuan Xu
Xiaobing Feng
Huimin Cui
AAML
34
0
0
31 Mar 2025
Get the Agents Drunk: Memory Perturbations in Autonomous Agent-based Recommender Systems
Shiyi Yang
Z. Hu
Chen Wang
Tong Yu
Xiwei Xu
Liming Zhu
Lina Yao
AAML
37
0
0
31 Mar 2025
Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models
Runpeng Dai
Run Yang
Fan Zhou
Hongtu Zhu
26
0
0
28 Mar 2025
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
Chung-En Sun
Ge Yan
Tsui-Wei Weng
KELM
LRM
60
0
0
27 Mar 2025
Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models
Shih-Wen Ke
Guan-Yu Lai
Guo-Lin Fang
Hsi-Yuan Kao
SILM
84
0
0
26 Mar 2025
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy
Joonhyun Jeong
Seyun Bae
Yeonsung Jung
Jaeryong Hwang
Eunho Yang
AAML
43
0
0
26 Mar 2025
Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation
Zhiyao Ren
Yibing Zhan
B. Yu
Dacheng Tao
DiffM
67
0
0
25 Mar 2025
OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad
Luyao Tang
Yuxuan Yuan
C. L. P. Chen
Zeyu Zhang
Yue Huang
Kun Zhang
48
0
0
24 Mar 2025
Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models
Jiaming Ji
X. Chen
Rui Pan
Han Zhu
C. Zhang
...
Juntao Dai
Chi-Min Chan
Sirui Han
Yike Guo
Y. Yang
OffRL
74
2
0
22 Mar 2025
A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications
Jian-Yu Guan
J. Wu
J. Li
Chuanqi Cheng
Wei Yu Wu
LM&MA
69
0
0
21 Mar 2025
Towards LLM Guardrails via Sparse Representation Steering
Zeqing He
Zhibo Wang
Huiyu Xu
Kui Ren
LLMSV
49
1
0
21 Mar 2025
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Aladin Djuhera
S. Kadhe
Farhan Ahmed
Syed Zawad
Holger Boche
MoMe
49
0
0
21 Mar 2025
In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI
Shayne Longpre
Kevin Klyman
Ruth E. Appel
Sayash Kapoor
Rishi Bommasani
...
Victoria Westerhoff
Yacine Jernite
Rumman Chowdhury
Percy Liang
Arvind Narayanan
ELM
40
0
0
21 Mar 2025
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities
Aly M. Kassem
Bernhard Schölkopf
Zhijing Jin
24
0
0
20 Mar 2025
Detecting LLM-Written Peer Reviews
Vishisht Rao
Aounon Kumar
Himabindu Lakkaraju
Nihar B. Shah
DeLMO
AAML
78
0
0
20 Mar 2025
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration
Andy Zhou
Kevin E. Wu
Francesco Pinto
Z. Chen
Yi Zeng
Yu Yang
Shuang Yang
Sanmi Koyejo
James Zou
Bo Li
LLMAG
AAML
75
0
0
20 Mar 2025
Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings
Zonghao Ying
Guangyi Zheng
Yongxin Huang
Deyue Zhang
Wenxin Zhang
Quanchen Zou
Aishan Liu
X. Liu
Dacheng Tao
ELM
74
6
0
19 Mar 2025
AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
Dillon Bowen
Ann-Kathrin Dombrowski
Adam Gleave
Chris Cundy
ELM
48
0
0
17 Mar 2025
Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents
Juhee Kim
Woohyuk Choi
Byoungyoung Lee
LLMAG
79
1
0
17 Mar 2025
Augmented Adversarial Trigger Learning
Zhe Wang
Yanjun Qi
53
0
0
16 Mar 2025
Empirical Privacy Variance
Yuzheng Hu
Fan Wu
Ruicheng Xian
Yuhang Liu
Lydia Zakynthinou
Pritish Kamath
Chiyuan Zhang
David A. Forsyth
62
0
0
16 Mar 2025
Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning
Yiwei Chen
Yuguang Yao
Yihua Zhang
Bingquan Shen
Gaowen Liu
Sijia Liu
AAML
MU
58
1
0
14 Mar 2025
Safe Vision-Language Models via Unsafe Weights Manipulation
Moreno DÍncà
E. Peruzzo
Xingqian Xu
Humphrey Shi
N. Sebe
Massimiliano Mancini
MU
55
0
0
14 Mar 2025
Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search
Andy Zhou
MU
67
0
0
13 Mar 2025
Rethinking Prompt-based Debiasing in Large Language Models
Xinyi Yang
Runzhe Zhan
Derek F. Wong
Shu Yang
Junchao Wu
Lidia S. Chao
ALM
60
1
0
12 Mar 2025
Generating Robot Constitutions & Benchmarks for Semantic Safety
P. Sermanet
Anirudha Majumdar
A. Irpan
Dmitry Kalashnikov
Vikas Sindhwani
LM&Ro
58
0
0
11 Mar 2025
Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation
Wenlong Meng
Fan Zhang
Wendao Yao
Zhenyuan Guo
Y. Li
Chengkun Wei
Wenzhi Chen
AAML
38
1
0
11 Mar 2025
Previous
1
2
3
4
5
...
17
18
19
Next