Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2308.03825
Cited By
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
7 August 2023
Xinyue Shen
Z. Chen
Michael Backes
Yun Shen
Yang Zhang
SILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
""Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models"
50 / 178 papers shown
Title
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Rheeya Uppaal
Apratim De
Yiting He
Yiquao Zhong
Junjie Hu
29
7
0
22 May 2024
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
Shang Shang
Xinqiang Zhao
Zhongjiang Yao
Yepeng Yao
Liya Su
Zijing Fan
Xiaodan Zhang
Zhengwei Jiang
50
3
0
06 May 2024
When LLMs Meet Cybersecurity: A Systematic Literature Review
Jie Zhang
Haoyu Bu
Hui Wen
Yu Chen
Lun Li
Hongsong Zhu
24
36
0
06 May 2024
PICLe: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning
Hyeong Kyu Choi
Yixuan Li
59
17
0
03 May 2024
Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts
Tianyu Zhang
Zixuan Zhao
Jiaqi Huang
Jingyu Hua
Sheng Zhong
19
0
0
12 Apr 2024
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
Zeyi Liao
Huan Sun
AAML
39
72
0
11 Apr 2024
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
Bibek Upadhayay
Vahid Behzadan
AAML
18
13
0
09 Apr 2024
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
M. Russinovich
Ahmed Salem
Ronen Eldan
31
76
0
02 Apr 2024
Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs
Shu Yang
Jiayuan Su
Han Jiang
Mengdi Li
Keyuan Cheng
Muhammad Asif Ali
Lijie Hu
Di Wang
16
5
0
30 Mar 2024
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao
Edoardo Debenedetti
Alexander Robey
Maksym Andriushchenko
Francesco Croce
...
Nicolas Flammarion
George J. Pappas
F. Tramèr
Hamed Hassani
Eric Wong
ALM
ELM
AAML
52
92
0
28 Mar 2024
Exploring the Deceptive Power of LLM-Generated Fake News: A Study of Real-World Detection Challenges
Yanshen Sun
Jianfeng He
Limeng Cui
Shuo Lei
Chang-Tien Lu
35
28
0
27 Mar 2024
Exploring the Privacy Protection Capabilities of Chinese Large Language Models
Yuqi Yang
Xiaowen Huang
Jitao Sang
ELM
PILM
AILaw
30
1
0
27 Mar 2024
AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue
Yunlong Tang
Daiki Shimada
Jing Bi
Chenliang Xu
VGen
27
17
0
24 Mar 2024
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Ji-Rong Wen
42
38
0
14 Mar 2024
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
Renjie Pi
Tianyang Han
Wei Xiong
Jipeng Zhang
Runtao Liu
Rui Pan
Tong Zhang
MLLM
33
33
0
13 Mar 2024
A Safe Harbor for AI Evaluation and Red Teaming
Shayne Longpre
Sayash Kapoor
Kevin Klyman
Ashwin Ramaswami
Rishi Bommasani
...
Daniel Kang
Sandy Pentland
Arvind Narayanan
Percy Liang
Peter Henderson
49
37
0
07 Mar 2024
AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks
Jiacen Xu
Jack W. Stokes
Geoff McDonald
Xuesong Bai
David Marshall
Siyue Wang
Adith Swaminathan
Zhou Li
35
48
0
02 Mar 2024
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu
Kai Zhang
Yuan Li
Zhiling Yan
Chujie Gao
...
Yue Huang
Hanchi Sun
Jianfeng Gao
Lifang He
Lichao Sun
VLM
VGen
EGVM
65
241
0
27 Feb 2024
Eight Methods to Evaluate Robust Unlearning in LLMs
Aengus Lynch
Phillip Guo
Aidan Ewart
Stephen Casper
Dylan Hadfield-Menell
ELM
MU
35
55
0
26 Feb 2024
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
Jiabao Ji
Bairu Hou
Alexander Robey
George J. Pappas
Hamed Hassani
Yang Zhang
Eric Wong
Shiyu Chang
AAML
37
39
0
25 Feb 2024
How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries
Somnath Banerjee
Sayan Layek
Rima Hazra
Animesh Mukherjee
24
10
0
23 Feb 2024
Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
Xiaoxia Li
Siyuan Liang
Jiyi Zhang
Hansheng Fang
Aishan Liu
Ee-Chien Chang
69
23
0
21 Feb 2024
Is the System Message Really Important to Jailbreaks in Large Language Models?
Xiaotian Zou
Yongkang Chen
Ke Li
22
12
0
20 Feb 2024
Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
Christian Schlarmann
Naman D. Singh
Francesco Croce
Matthias Hein
VLM
AAML
39
36
0
19 Feb 2024
A StrongREJECT for Empty Jailbreaks
Alexandra Souly
Qingyuan Lu
Dillon Bowen
Tu Trinh
Elvis Hsieh
...
Pieter Abbeel
Justin Svegliato
Scott Emmons
Olivia Watkins
Sam Toyer
14
64
0
15 Feb 2024
Instruction Backdoor Attacks Against Customized LLMs
Rui Zhang
Hongwei Li
Rui Wen
Wenbo Jiang
Yuan Zhang
Michael Backes
Yun Shen
Yang Zhang
AAML
SILM
30
21
0
14 Feb 2024
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
Xing-ming Guo
Fangxu Yu
Huan Zhang
Lianhui Qin
Bin Hu
AAML
112
69
0
13 Feb 2024
Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs
Víctor Gallego
SyDa
25
6
0
12 Feb 2024
PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
Wei Zou
Runpeng Geng
Binghui Wang
Jinyuan Jia
SILM
23
45
1
12 Feb 2024
Fight Back Against Jailbreaking via Prompt Adversarial Tuning
Yichuan Mo
Yuji Wang
Zeming Wei
Yisen Wang
AAML
SILM
47
24
0
09 Feb 2024
Comprehensive Assessment of Jailbreak Attacks Against LLMs
Junjie Chu
Yugeng Liu
Ziqing Yang
Xinyue Shen
Michael Backes
Yang Zhang
AAML
25
65
0
08 Feb 2024
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei
Kaixuan Huang
Yangsibo Huang
Tinghao Xie
Xiangyu Qi
Mengzhou Xia
Prateek Mittal
Mengdi Wang
Peter Henderson
AAML
55
78
0
07 Feb 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
16
308
0
06 Feb 2024
GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
Haibo Jin
Ruoxi Chen
Andy Zhou
Yang Zhang
Haohan Wang
LLMAG
19
19
0
05 Feb 2024
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Yongshuo Zong
Ondrej Bohdal
Tingyang Yu
Yongxin Yang
Timothy M. Hospedales
VLM
MLLM
52
56
0
03 Feb 2024
Building Guardrails for Large Language Models
Yizhen Dong
Ronghui Mu
Gao Jin
Yi Qi
Jinwei Hu
Xingyu Zhao
Jie Meng
Wenjie Ruan
Xiaowei Huang
OffRL
57
23
0
02 Feb 2024
An Early Categorization of Prompt Injection Attacks on Large Language Models
Sippo Rossi
Alisia Marianne Michel
R. Mukkamala
J. Thatcher
SILM
AAML
19
16
0
31 Jan 2024
Weak-to-Strong Jailbreaking on Large Language Models
Xuandong Zhao
Xianjun Yang
Tianyu Pang
Chao Du
Lei Li
Yu-Xiang Wang
William Yang Wang
26
52
0
30 Jan 2024
AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
Dong Shu
Mingyu Jin
Suiyuan Zhu
Beichen Wang
Zihao Zhou
Chong Zhang
Yongfeng Zhang
ELM
37
12
0
17 Jan 2024
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Tianyu Cui
Yanling Wang
Chuanpu Fu
Yong Xiao
Sijia Li
...
Junwu Xiong
Xinyu Kong
Zujie Wen
Ke Xu
Qi Li
52
56
0
11 Jan 2024
Malla: Demystifying Real-world Large Language Model Integrated Malicious Services
Zilong Lin
Jian Cui
Xiaojing Liao
XiaoFeng Wang
19
19
0
06 Jan 2024
Exploiting Novel GPT-4 APIs
Kellin Pelrine
Mohammad Taufeeque
Michal Zajkac
Euan McLean
Adam Gleave
SILM
15
20
0
21 Dec 2023
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
Zhuo Zhang
Guangyu Shen
Guanhong Tao
Shuyang Cheng
Xiangyu Zhang
23
13
0
08 Dec 2023
Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
Matteo Gioele Collu
Tom Janssen-Groesbeek
Stefanos Koffas
Mauro Conti
S. Picek
14
1
0
06 Dec 2023
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
Yifan Yao
Jinhao Duan
Kaidi Xu
Yuanfang Cai
Eric Sun
Yue Zhang
PILM
ELM
24
463
0
04 Dec 2023
Negotiating with LLMS: Prompt Hacks, Skill Gaps, and Reasoning Deficits
Johannes Schneider
Steffi Haag
Leona Chandra Kruse
8
13
0
26 Nov 2023
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
Yichen Gong
Delong Ran
Jinyuan Liu
Conglei Wang
Tianshuo Cong
Anyu Wang
Sisi Duan
Xiaoyun Wang
MLLM
129
117
0
09 Nov 2023
A Survey of Large Language Models in Medicine: Progress, Application, and Challenge
Hongjian Zhou
Fenglin Liu
Boyang Gu
Xinyu Zou
Jinfa Huang
...
Yefeng Zheng
Lei A. Clifton
Zheng Li
Fenglin Liu
David A. Clifton
LM&MA
31
106
0
09 Nov 2023
Can LLMs Follow Simple Rules?
Norman Mu
Sarah Chen
Zifan Wang
Sizhe Chen
David Karamardian
Lulwa Aljeraisy
Basel Alomair
Dan Hendrycks
David A. Wagner
ALM
16
26
0
06 Nov 2023
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Sam Toyer
Olivia Watkins
Ethan Mendes
Justin Svegliato
Luke Bailey
...
Karim Elmaaroufi
Pieter Abbeel
Trevor Darrell
Alan Ritter
Stuart J. Russell
6
71
0
02 Nov 2023
Previous
1
2
3
4
Next