Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.02483
Cited By
Jailbroken: How Does LLM Safety Training Fail?
5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Jailbroken: How Does LLM Safety Training Fail?"
50 / 634 papers shown
Title
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
Haoran Gu
Handing Wang
Yi Mei
Mengjie Zhang
Yaochu Jin
20
0
0
12 May 2025
Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks
Yixin Cheng
Hongcheng Guo
Yangming Li
Leonid Sigal
AAML
WaLM
57
0
0
08 May 2025
Stealthy LLM-Driven Data Poisoning Attacks Against Embedding-Based Retrieval-Augmented Recommender Systems
Fatemeh Nazary
Yashar Deldjoo
T. D. Noia
E. Sciascio
AAML
SILM
43
0
0
08 May 2025
Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety
Variath Madhupal Gautham Nair
Vishal Varma Dantuluri
22
0
0
07 May 2025
Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents
Christian Schroeder de Witt
AAML
AI4CE
71
0
0
04 May 2025
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs
Haoming Yang
Ke Ma
X. Jia
Yingfei Sun
Qianqian Xu
Q. Huang
AAML
80
0
0
03 May 2025
Attack and defense techniques in large language models: A survey and new perspectives
Zhiyu Liao
Kang Chen
Yuanguo Lin
Kangkang Li
Yunxuan Liu
Hefeng Chen
Xingwang Huang
Yuanhui Yu
AAML
54
0
0
02 May 2025
LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures
Francisco Aguilera-Martínez
Fernando Berzal
PILM
48
0
0
02 May 2025
Transferable Adversarial Attacks on Black-Box Vision-Language Models
Kai Hu
Weichen Yu
L. Zhang
Alexander Robey
Andy Zou
Chengming Xu
Haoqi Hu
Matt Fredrikson
AAML
VLM
49
0
0
02 May 2025
XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs
Marco Arazzi
Vignesh Kumar Kembu
Antonino Nocera
V. P.
78
0
0
30 Apr 2025
JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift
Julien Piet
Xiao Huang
Dennis Jacob
Annabella Chow
Maha Alrashed
Geng Zhao
Zhanhao Hu
Chawin Sitawarin
Basel Alomair
David A. Wagner
AAML
63
0
0
28 Apr 2025
Modular Machine Learning: An Indispensable Path towards New-Generation Large Language Models
X. Wang
Haoyang Li
Zeyang Zhang
H. Chen
Wenwu Zhu
LRM
77
0
0
28 Apr 2025
A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning
Greg Gluch
Shafi Goldwasser
AAML
37
0
0
28 Apr 2025
CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges
Y. Li
Qizhi Pei
Mengyuan Sun
Honglin Lin
Chenlin Ming
Xin Gao
Jiang Wu
C. He
Lijun Wu
ELM
LRM
40
0
0
27 Apr 2025
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
Ren-Wei Liang
Chin-Ting Hsu
Chan-Hung Yu
Saransh Agrawal
Shih-Cheng Huang
Shang-Tse Chen
Kuan-Hao Huang
Shao-Hua Sun
76
0
0
27 Apr 2025
Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs
Mohammad Akbar-Tajari
Mohammad Taher Pilehvar
Mohammad Mahmoody
AAML
46
0
0
26 Apr 2025
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models
Bang An
Shiyue Zhang
Mark Dredze
54
0
0
25 Apr 2025
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
Narek Maloyan
Dmitry Namiot
SILM
AAML
ELM
75
0
0
25 Apr 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David E. Evans
LLMSV
74
0
0
23 Apr 2025
BELL: Benchmarking the Explainability of Large Language Models
Syed Quiser Ahmed
Bharathi Vokkaliga Ganesh
Jagadish Babu P
Karthick Selvaraj
ReddySiva Naga Parvathi Devi
Sravya Kappala
ELM
87
0
0
22 Apr 2025
RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search
Quy-Anh Dang
Chris Ngo
Truong Son-Hy
AAML
SyDa
33
0
0
21 Apr 2025
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation
Yichen Wu
Xudong Pan
Geng Hong
Min Yang
LLMAG
29
0
0
18 Apr 2025
DoomArena: A framework for Testing AI Agents Against Evolving Security Threats
Léo Boisvert
Mihir Bansal
Chandra Kiran Reddy Evuru
Gabriel Huang
Abhay Puri
...
Quentin Cappart
Jason Stanley
Alexandre Lacoste
Alexandre Drouin
Krishnamurthy Dvijotham
30
0
0
18 Apr 2025
DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification
Yu Li
Han Jiang
Zhihua Wei
AAML
29
0
0
18 Apr 2025
GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms
Sinan He
An Wang
30
0
0
17 Apr 2025
AI Safety Should Prioritize the Future of Work
Sanchaita Hazra
Bodhisattwa Prasad Majumder
Tuhin Chakrabarty
32
0
0
16 Apr 2025
The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
Kristina Nikolić
Luze Sun
Jie Zhang
F. Tramèr
23
0
0
14 Apr 2025
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?
Yanbo Wang
Jiyang Guan
Jian Liang
Ran He
41
0
0
14 Apr 2025
CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent
Liang-bo Ning
Shijie Wang
Wenqi Fan
Qing Li
Xin Xu
Hao Chen
Feiran Huang
AAML
21
16
0
13 Apr 2025
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
Jiahe Guo
Yulin Hu
Yang Deng
An Zhang
...
Xinyang Han
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
AAML
LLMSV
39
0
0
13 Apr 2025
The Structural Safety Generalization Problem
Julius Broomfield
Tom Gibbs
Ethan Kosak-Hine
George Ingebretsen
Tia Nasir
Jason Zhang
Reihaneh Iranmanesh
Sara Pieri
Reihaneh Rabbany
Kellin Pelrine
AAML
23
0
0
13 Apr 2025
X-Guard: Multilingual Guard Agent for Content Moderation
Bibek Upadhayay
Vahid Behzadan
Ph.D
29
1
0
11 Apr 2025
Geneshift: Impact of different scenario shift on Jailbreaking LLM
Tianyi Wu
Zhiwei Xue
Yue Liu
Jiaheng Zhang
Bryan Hooi
See-Kiong Ng
34
0
0
10 Apr 2025
AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks
Charlotte Siska
Anush Sankaran
AAML
45
0
0
10 Apr 2025
Bypassing Safety Guardrails in LLMs Using Humor
Pedro Cisneros-Velarde
29
0
0
09 Apr 2025
Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking
Yu-Hang Wu
Yu-Jie Xiong
Jie-Zhang
AAML
25
0
0
08 Apr 2025
A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models
Carlos Peláez-González
Andrés Herrera-Poyatos
Cristina Zuheros
David Herrera-Poyatos
Virilo Tejedor
F. Herrera
AAML
19
0
0
07 Apr 2025
Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs
Ling Hu
Yuemei Xu
Xiaoyang Gu
Letao Han
28
0
0
07 Apr 2025
StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation
Shenyang Liu
Yang Gao
Shaoyan Zhai
Liqiang Wang
29
0
0
06 Apr 2025
Rethinking Reflection in Pre-Training
Essential AI
Darsh J Shah
Peter Rushton
Somanshu Singla
Mohit Parmar
...
Philip Monk
Platon Mazarakis
Ritvik Kapila
Saurabh Srivastava
Tim Romanski
ReLM
LRM
43
3
0
05 Apr 2025
On the Connection Between Diffusion Models and Molecular Dynamics
Liam Harcombe
Timothy T. Duignan
DiffM
43
0
0
04 Apr 2025
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment
Yifan Wang
Runjin Chen
Bolian Li
David Cho
Yihe Deng
Ruqi Zhang
Tianlong Chen
Zhangyang Wang
A. Grama
Junyuan Hong
SyDa
48
0
0
03 Apr 2025
Representation Bending for Large Language Model Safety
Ashkan Yousefpour
Taeheon Kim
Ryan S. Kwon
Seungbeen Lee
Wonje Jeung
Seungju Han
Alvin Wan
Harrison Ngan
Youngjae Yu
Jonghyun Choi
AAML
ALM
KELM
52
0
0
02 Apr 2025
PiCo: Jailbreaking Multimodal Large Language Models via
Pi
\textbf{Pi}
Pi
ctorial
Co
\textbf{Co}
Co
de Contextualization
Aofan Liu
Lulu Tang
Ting Pan
Yuguo Yin
Bin Wang
Ao Yang
MLLM
AAML
40
0
0
02 Apr 2025
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning
S. Chen
Xiao Yu
Ninareh Mehrabi
Rahul Gupta
Zhou Yu
Ruoxi Jia
AAML
LLMAG
45
0
0
02 Apr 2025
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Jiawei Wang
Yushen Zuo
Yuanjun Chai
Z. Liu
Yichen Fu
Yichun Feng
Kin-Man Lam
AAML
VLM
40
0
0
02 Apr 2025
Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses
Zhengchun Shang
Wenlan Wei
AAML
38
0
0
02 Apr 2025
Agents
Under
Siege
\textit{Agents Under Siege}
Agents Under Siege
: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
Rana Muhammad Shahroz Khan
Zhen Tan
Sukwon Yun
Charles Flemming
Tianlong Chen
AAML
LLMAG
Presented at
ResearchTrend Connect | LLMAG
on
23 Apr 2025
93
2
0
31 Mar 2025
Encrypted Prompt: Securing LLM Applications Against Unauthorized Actions
Shih-Han Chan
AAML
46
0
0
29 Mar 2025
FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models
Dahyun Jung
Seungyoon Lee
Hyeonseok Moon
Chanjun Park
Heuiseok Lim
AAML
ALM
ELM
53
0
0
25 Mar 2025
1
2
3
4
...
11
12
13
Next