ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.02483
  4. Cited By
Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?

5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Jailbroken: How Does LLM Safety Training Fail?"

50 / 634 papers shown
Title
Flames: Benchmarking Value Alignment of LLMs in Chinese
Flames: Benchmarking Value Alignment of LLMs in Chinese
Kexin Huang
Xiangyang Liu
Qianyu Guo
Tianxiang Sun
Jiawei Sun
...
Yixu Wang
Yan Teng
Xipeng Qiu
Yingchun Wang
Dahua Lin
ALM
19
7
0
12 Nov 2023
Fake Alignment: Are LLMs Really Aligned Well?
Fake Alignment: Are LLMs Really Aligned Well?
Yixu Wang
Yan Teng
Kexin Huang
Chengqi Lyu
Songyang Zhang
Wenwei Zhang
Xingjun Ma
Yu-Gang Jiang
Yu Qiao
Yingchun Wang
25
14
0
10 Nov 2023
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
Yichen Gong
Delong Ran
Jinyuan Liu
Conglei Wang
Tianshuo Cong
Anyu Wang
Sisi Duan
Xiaoyun Wang
MLLM
129
117
0
09 Nov 2023
A Survey of Large Language Models in Medicine: Progress, Application,
  and Challenge
A Survey of Large Language Models in Medicine: Progress, Application, and Challenge
Hongjian Zhou
Fenglin Liu
Boyang Gu
Xinyu Zou
Jinfa Huang
...
Yefeng Zheng
Lei A. Clifton
Zheng Li
Fenglin Liu
David A. Clifton
LM&MA
31
106
0
09 Nov 2023
Frontier Language Models are not Robust to Adversarial Arithmetic, or
  "What do I need to say so you agree 2+2=5?
Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?
C. D. Freeman
Laura J. Culp
Aaron T Parisi
Maxwell Bileschi
Gamaleldin F. Elsayed
...
Peter J. Liu
Roman Novak
Yundi Qian
Noah Fiedel
Jascha Narain Sohl-Dickstein
AAML
20
2
0
08 Nov 2023
Scalable and Transferable Black-Box Jailbreaks for Language Models via
  Persona Modulation
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
Rusheb Shah
Quentin Feuillade--Montixi
Soroush Pour
Arush Tagade
Stephen Casper
Javier Rando
13
121
0
06 Nov 2023
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
Xuan Li
Zhanke Zhou
Jianing Zhu
Jiangchao Yao
Tongliang Liu
Bo Han
37
149
0
06 Nov 2023
Can LLMs Follow Simple Rules?
Can LLMs Follow Simple Rules?
Norman Mu
Sarah Chen
Zifan Wang
Sizhe Chen
David Karamardian
Lulwa Aljeraisy
Basel Alomair
Dan Hendrycks
David A. Wagner
ALM
18
26
0
06 Nov 2023
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Sam Toyer
Olivia Watkins
Ethan Mendes
Justin Svegliato
Luke Bailey
...
Karim Elmaaroufi
Pieter Abbeel
Trevor Darrell
Alan Ritter
Stuart J. Russell
6
71
0
02 Nov 2023
The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from
  Human Feedback
The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback
Nathan Lambert
Roberto Calandra
ALM
13
30
0
31 Oct 2023
RAIFLE: Reconstruction Attacks on Interaction-based Federated Learning with Adversarial Data Manipulation
RAIFLE: Reconstruction Attacks on Interaction-based Federated Learning with Adversarial Data Manipulation
Dzung Pham
Shreyas Kulkarni
Amir Houmansadr
6
0
0
29 Oct 2023
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of
  LLMs through a Global Scale Prompt Hacking Competition
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
Sander Schulhoff
Jeremy Pinto
Anaum Khan
Louis-Franccois Bouchard
Chenglei Si
Svetlina Anati
Valen Tagliabue
Anson Liu Kost
Christopher Carnahan
Jordan L. Boyd-Graber
SILM
21
41
0
24 Oct 2023
Self-Guard: Empower the LLM to Safeguard Itself
Self-Guard: Empower the LLM to Safeguard Itself
Zezhong Wang
Fangkai Yang
Lu Wang
Pu Zhao
Hongru Wang
Liang Chen
Qingwei Lin
Kam-Fai Wong
45
28
0
24 Oct 2023
The Janus Interface: How Fine-Tuning in Large Language Models Amplifies
  the Privacy Risks
The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks
Xiaoyi Chen
Siyuan Tang
Rui Zhu
Shijun Yan
Lei Jin
Zihao Wang
Liya Su
Zhikun Zhang
XiaoFeng Wang
Haixu Tang
AAML
PILM
11
16
0
24 Oct 2023
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
  Language Models
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Sicheng Zhu
Ruiyi Zhang
Bang An
Gang Wu
Joe Barrow
Zichao Wang
Furong Huang
A. Nenkova
Tong Sun
SILM
AAML
25
40
0
23 Oct 2023
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Yupei Liu
Yuqi Jia
Runpeng Geng
Jinyuan Jia
Neil Zhenqiang Gong
SILM
LLMAG
16
57
0
19 Oct 2023
Quantifying Language Models' Sensitivity to Spurious Features in Prompt
  Design or: How I learned to start worrying about prompt formatting
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
Melanie Sclar
Yejin Choi
Yulia Tsvetkov
Alane Suhr
28
295
0
17 Oct 2023
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications
  with Programmable Rails
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
Traian Rebedea
R. Dinu
Makesh Narsimhan Sreedhar
Christopher Parisien
Jonathan Cohen
KELM
14
131
0
16 Oct 2023
Privacy in Large Language Models: Attacks, Defenses and Future
  Directions
Privacy in Large Language Models: Attacks, Defenses and Future Directions
Haoran Li
Yulin Chen
Jinglong Luo
Yan Kang
Xiaojin Zhang
Qi Hu
Chunkit Chan
Yangqiu Song
PILM
38
39
0
16 Oct 2023
Prompt Packer: Deceiving LLMs through Compositional Instruction with
  Hidden Attacks
Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks
Shuyu Jiang
Xingshu Chen
Rui Tang
19
22
0
16 Oct 2023
Is Certifying $\ell_p$ Robustness Still Worthwhile?
Is Certifying ℓp\ell_pℓp​ Robustness Still Worthwhile?
Ravi Mangal
Klas Leino
Zifan Wang
Kai Hu
Weicheng Yu
Corina S. Pasareanu
Anupam Datta
Matt Fredrikson
AAML
OOD
20
1
0
13 Oct 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Edgar Dobriban
Hamed Hassani
George J. Pappas
Eric Wong
AAML
48
566
0
12 Oct 2023
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Yangsibo Huang
Samyak Gupta
Mengzhou Xia
Kai Li
Danqi Chen
AAML
22
264
0
10 Oct 2023
Multilingual Jailbreak Challenges in Large Language Models
Multilingual Jailbreak Challenges in Large Language Models
Yue Deng
Wenxuan Zhang
Sinno Jialin Pan
Lidong Bing
AAML
29
112
0
10 Oct 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context
  Demonstrations
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
40
233
0
10 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users
  Do Not Intend To!
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
44
520
0
05 Oct 2023
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
38
215
0
05 Oct 2023
Misusing Tools in Large Language Models With Visual Adversarial Examples
Misusing Tools in Large Language Models With Visual Adversarial Examples
Xiaohan Fu
Zihan Wang
Shuheng Li
Rajesh K. Gupta
Niloofar Mireshghallah
Taylor Berg-Kirkpatrick
Earlence Fernandes
AAML
13
24
0
04 Oct 2023
Low-Resource Languages Jailbreak GPT-4
Low-Resource Languages Jailbreak GPT-4
Zheng-Xin Yong
Cristina Menghini
Stephen H. Bach
SILM
12
169
0
03 Oct 2023
Jailbreaker in Jail: Moving Target Defense for Large Language Models
Jailbreaker in Jail: Moving Target Defense for Large Language Models
Bocheng Chen
Advait Paliwal
Qiben Yan
AAML
21
14
0
03 Oct 2023
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language
  Models
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu
Nan Xu
Muhao Chen
Chaowei Xiao
SILM
22
257
0
03 Oct 2023
Can Language Models be Instructed to Protect Personal Information?
Can Language Models be Instructed to Protect Personal Information?
Yang Chen
Ethan Mendes
Sauvik Das
Wei-ping Xu
Alan Ritter
PILM
16
34
0
03 Oct 2023
Large Language Models Cannot Self-Correct Reasoning Yet
Large Language Models Cannot Self-Correct Reasoning Yet
Jie Huang
Xinyun Chen
Swaroop Mishra
Huaixiu Steven Zheng
Adams Wei Yu
Xinying Song
Denny Zhou
ReLM
LRM
6
415
0
03 Oct 2023
On the Safety of Open-Sourced Large Language Models: Does Alignment
  Really Prevent Them From Being Misused?
On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?
Hangfan Zhang
Zhimeng Guo
Huaisheng Zhu
Bochuan Cao
Lu Lin
Jinyuan Jia
Jinghui Chen
Di Wu
70
23
0
02 Oct 2023
LLM Lies: Hallucinations are not Bugs, but Features as Adversarial
  Examples
LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples
Jia-Yu Yao
Kun-Peng Ning
Zhen-Hui Liu
Munan Ning
Li Yuan
HILM
LRM
AAML
15
168
0
02 Oct 2023
LoRA ensembles for large language model fine-tuning
LoRA ensembles for large language model fine-tuning
Xi Wang
Laurence Aitchison
Maja Rudolph
UQCV
8
19
0
29 Sep 2023
Warfare:Breaking the Watermark Protection of AI-Generated Content
Warfare:Breaking the Watermark Protection of AI-Generated Content
Guanlin Li
Yifei Chen
Jie M. Zhang
Shangwei Guo
Shangwei Guo
Tianwei Zhang
Jiwei Li
Tianwei Zhang
WIGM
53
3
0
27 Sep 2023
Can LLM-Generated Misinformation Be Detected?
Can LLM-Generated Misinformation Be Detected?
Canyu Chen
Kai Shu
DeLMO
29
157
0
25 Sep 2023
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Tianle Li
Siyuan Zhuang
...
Zi Lin
Eric P. Xing
Joseph E. Gonzalez
Ion Stoica
Haotong Zhang
22
170
0
21 Sep 2023
How Robust is Google's Bard to Adversarial Image Attacks?
How Robust is Google's Bard to Adversarial Image Attacks?
Yinpeng Dong
Huanran Chen
Jiawei Chen
Zhengwei Fang
X. Yang
Yichi Zhang
Yu Tian
Hang Su
Jun Zhu
AAML
13
100
0
21 Sep 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated
  Jailbreak Prompts
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
110
292
0
19 Sep 2023
Understanding Catastrophic Forgetting in Language Models via Implicit
  Inference
Understanding Catastrophic Forgetting in Language Models via Implicit Inference
Suhas Kotha
Jacob Mitchell Springer
Aditi Raghunathan
CLL
23
56
0
18 Sep 2023
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Bochuan Cao
Yu Cao
Lu Lin
Jinghui Chen
AAML
23
129
0
18 Sep 2023
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language
  Models that Follow Instructions
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
Federico Bianchi
Mirac Suzgun
Giuseppe Attanasio
Paul Röttger
Dan Jurafsky
Tatsunori Hashimoto
James Y. Zou
ALM
LM&MA
LRM
12
175
0
14 Sep 2023
FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively
  Discovering Jailbreak Vulnerabilities in Large Language Models
FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models
Dongyu Yao
Jianshu Zhang
Ian G. Harris
Marcel Carlsson
16
30
0
11 Sep 2023
Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large
  Language Models
Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models
Arka Dutta
Adel Khorramrouz
Sujan Dutta
Ashiqur R. KhudaBukhsh
6
0
0
08 Sep 2023
Certifying LLM Safety against Adversarial Prompting
Certifying LLM Safety against Adversarial Prompting
Aounon Kumar
Chirag Agarwal
Suraj Srinivas
Aaron Jiaxun Li
S. Feizi
Himabindu Lakkaraju
AAML
22
161
0
06 Sep 2023
Demystifying RCE Vulnerabilities in LLM-Integrated Apps
Demystifying RCE Vulnerabilities in LLM-Integrated Apps
Tong Liu
Zizhuang Deng
Guozhu Meng
Yuekang Li
Kai Chen
SILM
29
19
0
06 Sep 2023
Open Sesame! Universal Black Box Jailbreaking of Large Language Models
Open Sesame! Universal Black Box Jailbreaking of Large Language Models
Raz Lapid
Ron Langberg
Moshe Sipper
AAML
9
103
0
04 Sep 2023
Siren's Song in the AI Ocean: A Survey on Hallucination in Large
  Language Models
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang
Yafu Li
Leyang Cui
Deng Cai
Lemao Liu
...
Longyue Wang
A. Luu
Wei Bi
Freda Shi
Shuming Shi
RALM
LRM
HILM
36
518
0
03 Sep 2023
Previous
123...111213
Next