ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.02483
  4. Cited By
Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?

5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Jailbroken: How Does LLM Safety Training Fail?"

50 / 634 papers shown
Title
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix
  Controller
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai
Yuchen Zhang
Shichang Zhang
Fan Yin
Difan Zou
Yisong Yue
Ziniu Hu
21
0
0
04 Jun 2024
Safeguarding Large Language Models: A Survey
Safeguarding Large Language Models: A Survey
Yi Dong
Ronghui Mu
Yanghao Zhang
Siqi Sun
Tianle Zhang
...
Yi Qi
Jinwei Hu
Jie Meng
Saddek Bensalem
Xiaowei Huang
OffRL
KELM
AILaw
35
17
0
03 Jun 2024
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models
  and Their Defenses
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min-Bin Lin
AAML
60
29
0
03 Jun 2024
BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of
  Large Language Models
BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models
Jiaqi Xue
Meng Zheng
Yebowen Hu
Fei Liu
Xun Chen
Qian Lou
AAML
SILM
21
25
0
03 Jun 2024
LIDAO: Towards Limited Interventions for Debiasing (Large) Language
  Models
LIDAO: Towards Limited Interventions for Debiasing (Large) Language Models
Tianci Liu
Haoyu Wang
Shiyang Wang
Yu Cheng
Jing Gao
ALM
30
0
0
01 Jun 2024
Exploring Vulnerabilities and Protections in Large Language Models: A
  Survey
Exploring Vulnerabilities and Protections in Large Language Models: A Survey
Frank Weizhen Liu
Chenhui Hu
AAML
37
7
0
01 Jun 2024
Improved Techniques for Optimization-Based Jailbreaking on Large
  Language Models
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Xiaojun Jia
Tianyu Pang
Chao Du
Yihao Huang
Jindong Gu
Yang Liu
Xiaochun Cao
Min-Bin Lin
AAML
44
22
0
31 May 2024
OR-Bench: An Over-Refusal Benchmark for Large Language Models
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
38
33
0
31 May 2024
DORY: Deliberative Prompt Recovery for LLM
DORY: Deliberative Prompt Recovery for LLM
Lirong Gao
Ru Peng
Yiming Zhang
Junbo Zhao
32
3
0
31 May 2024
Enhancing Jailbreak Attack Against Large Language Models through Silent
  Tokens
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
Jiahao Yu
Haozheng Luo
Jerry Yao-Chieh Hu
Wenbo Guo
Han Liu
Xinyu Xing
33
18
0
31 May 2024
Jailbreaking Large Language Models Against Moderation Guardrails via
  Cipher Characters
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
Haibo Jin
Andy Zhou
Joe D. Menke
Haohan Wang
38
11
0
30 May 2024
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs
  against Jailbreak Attacks
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks
Chen Xiong
Xiangyu Qi
Pin-Yu Chen
Tsung-Yi Ho
AAML
32
18
0
30 May 2024
Typography Leads Semantic Diversifying: Amplifying Adversarial
  Transferability across Multimodal Large Language Models
Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models
Hao-Ran Cheng
Erjia Xiao
Jiahang Cao
Le Yang
Kaidi Xu
Jindong Gu
Renjing Xu
AAML
55
7
0
30 May 2024
PertEval: Unveiling Real Knowledge Capacity of LLMs with
  Knowledge-Invariant Perturbations
PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations
Jiatong Li
Renjun Hu
Kunzhe Huang
Zhuang Yan
Qi Liu
Mengxiao Zhu
Xing Shi
Wei Lin
KELM
46
4
0
30 May 2024
AutoBreach: Universal and Adaptive Jailbreaking with Efficient
  Wordplay-Guided Optimization
AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization
Jiawei Chen
Xiao Yang
Zhengwei Fang
Yu Tian
Yinpeng Dong
Zhaoxia Yin
Hang Su
24
1
0
30 May 2024
AI Risk Management Should Incorporate Both Safety and Security
AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi
Yangsibo Huang
Yi Zeng
Edoardo Debenedetti
Jonas Geiping
...
Chaowei Xiao
Bo-wen Li
Dawn Song
Peter Henderson
Prateek Mittal
AAML
43
10
0
29 May 2024
Voice Jailbreak Attacks Against GPT-4o
Voice Jailbreak Attacks Against GPT-4o
Xinyue Shen
Yixin Wu
Michael Backes
Yang Zhang
AuLLM
34
9
0
29 May 2024
Toxicity Detection for Free
Toxicity Detection for Free
Zhanhao Hu
Julien Piet
Geng Zhao
Jiantao Jiao
David A. Wagner
27
3
0
29 May 2024
Lazy Safety Alignment for Large Language Models against Harmful
  Fine-tuning
Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
42
23
0
28 May 2024
Exploiting LLM Quantization
Exploiting LLM Quantization
Kazuki Egashira
Mark Vero
Robin Staab
Jingxuan He
Martin Vechev
MQ
19
11
0
28 May 2024
Improved Generation of Adversarial Examples Against Safety-aligned LLMs
Improved Generation of Adversarial Examples Against Safety-aligned LLMs
Qizhang Li
Yiwen Guo
Wangmeng Zuo
Hao Chen
AAML
SILM
21
5
0
28 May 2024
Learning diverse attacks on large language models for robust red-teaming and safety tuning
Learning diverse attacks on large language models for robust red-teaming and safety tuning
Seanie Lee
Minsu Kim
Lynn Cherif
David Dobre
Juho Lee
...
Kenji Kawaguchi
Gauthier Gidel
Yoshua Bengio
Nikolay Malkin
Moksh Jain
AAML
53
12
0
28 May 2024
Cross-Modal Safety Alignment: Is textual unlearning all you need?
Cross-Modal Safety Alignment: Is textual unlearning all you need?
Trishna Chakraborty
Erfan Shayegani
Zikui Cai
Nael B. Abu-Ghazaleh
M. Salman Asif
Yue Dong
A. Roy-Chowdhury
Chengyu Song
39
15
0
27 May 2024
ReMoDetect: Reward Models Recognize Aligned LLM's Generations
ReMoDetect: Reward Models Recognize Aligned LLM's Generations
Hyunseok Lee
Jihoon Tack
Jinwoo Shin
DeLMO
36
0
0
27 May 2024
Navigating the Safety Landscape: Measuring Risks in Finetuning Large
  Language Models
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
Sheng-Hsuan Peng
Pin-Yu Chen
Matthew Hull
Duen Horng Chau
50
20
0
27 May 2024
Exploiting the Layered Intrinsic Dimensionality of Deep Models for
  Practical Adversarial Training
Exploiting the Layered Intrinsic Dimensionality of Deep Models for Practical Adversarial Training
Enes Altinisik
Safa Messaoud
H. Sencar
Hassan Sajjad
Sanjay Chawla
AAML
43
0
0
27 May 2024
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Chia-Yi Hsu
Yu-Lin Tsai
Chih-Hsun Lin
Pin-Yu Chen
Chia-Mu Yu
Chun-ying Huang
44
32
0
27 May 2024
The Uncanny Valley: Exploring Adversarial Robustness from a Flatness Perspective
The Uncanny Valley: Exploring Adversarial Robustness from a Flatness Perspective
Nils Philipp Walter
Linara Adilova
Jilles Vreeken
Michael Kamp
AAML
43
2
0
27 May 2024
Automatic Jailbreaking of the Text-to-Image Generative AI Systems
Automatic Jailbreaking of the Text-to-Image Generative AI Systems
Minseon Kim
Hyomin Lee
Boqing Gong
Huishuai Zhang
Sung Ju Hwang
27
11
0
26 May 2024
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning
  Attacks
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
Chak Tou Leong
Yi Cheng
Kaishuai Xu
Jian Wang
Hanlin Wang
Wenjie Li
AAML
43
17
0
25 May 2024
Sparse maximal update parameterization: A holistic approach to sparse
  training dynamics
Sparse maximal update parameterization: A holistic approach to sparse training dynamics
Nolan Dey
Shane Bergsma
Joel Hestness
19
5
0
24 May 2024
Robustifying Safety-Aligned Large Language Models through Clean Data
  Curation
Robustifying Safety-Aligned Large Language Models through Clean Data Curation
Xiaoqun Liu
Jiacheng Liang
Muchao Ye
Zhaohan Xi
AAML
45
17
0
24 May 2024
Impact of Non-Standard Unicode Characters on Security and Comprehension
  in Large Language Models
Impact of Non-Standard Unicode Characters on Security and Comprehension in Large Language Models
Johan S Daniel
Anand Pal
30
0
0
23 May 2024
MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While
  Preserving Their Usability
MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability
Yanrui Du
Sendong Zhao
Danyang Zhao
Ming Ma
Yuhan Chen
Liangyu Huo
Qing Yang
Dongliang Xu
Bing Qin
24
5
0
23 May 2024
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Rheeya Uppaal
Apratim De
Yiting He
Yiquao Zhong
Junjie Hu
29
7
0
22 May 2024
GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
Govind Ramesh
Yao Dou
Wei-ping Xu
PILM
32
10
0
21 May 2024
Hummer: Towards Limited Competitive Preference Dataset
Hummer: Towards Limited Competitive Preference Dataset
Li Jiang
Yusen Wu
Junwu Xiong
Jingqing Ruan
Yichuan Ding
Qingpei Guo
Zujie Wen
Jun Zhou
Xiaotie Deng
29
6
0
19 May 2024
Human-AI Safety: A Descendant of Generative AI and Control Systems
  Safety
Human-AI Safety: A Descendant of Generative AI and Control Systems Safety
Andrea V. Bajcsy
J. F. Fisac
32
7
0
16 May 2024
What is it for a Machine Learning Model to Have a Capability?
What is it for a Machine Learning Model to Have a Capability?
Jacqueline Harding
Nathaniel Sharadin
ELM
31
3
0
14 May 2024
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large
  Language Models
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
Raghuveer Peri
Sai Muralidhar Jayanthi
S. Ronanki
Anshu Bhatia
Karel Mundnich
...
Srikanth Vishnubhotla
Daniel Garcia-Romero
S. Srinivasan
Kyu J. Han
Katrin Kirchhoff
AAML
32
3
0
14 May 2024
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition
Ziyang Zhang
Qizhen Zhang
Jakob N. Foerster
AAML
32
18
0
13 May 2024
Mitigating Exaggerated Safety in Large Language Models
Mitigating Exaggerated Safety in Large Language Models
Ruchi Bhalani
Ruchira Ray
21
1
0
08 May 2024
Learning To See But Forgetting To Follow: Visual Instruction Tuning
  Makes LLMs More Prone To Jailbreak Attacks
Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks
Georgios Pantazopoulos
Amit Parekh
Malvina Nikandrou
Alessandro Suglia
32
5
0
07 May 2024
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit
  via Mechanistic Interpretability
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability
Jorge García-Carrasco
Alejandro Maté
Juan Trujillo
22
8
0
07 May 2024
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for
  Jailbreaking via Obfuscating Intent
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
Shang Shang
Xinqiang Zhao
Zhongjiang Yao
Yepeng Yao
Liya Su
Zijing Fan
Xiaodan Zhang
Zhengwei Jiang
55
3
0
06 May 2024
Testing and Understanding Erroneous Planning in LLM Agents through
  Synthesized User Inputs
Testing and Understanding Erroneous Planning in LLM Agents through Synthesized User Inputs
Zhenlan Ji
Daoyuan Wu
Pingchuan Ma
Zongjie Li
Shuai Wang
LLMAG
40
3
0
27 Apr 2024
Talking Nonsense: Probing Large Language Models' Understanding of
  Adversarial Gibberish Inputs
Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs
Valeriia Cherepanova
James Zou
AAML
28
4
0
26 Apr 2024
Don't Say No: Jailbreaking LLM by Suppressing Refusal
Don't Say No: Jailbreaking LLM by Suppressing Refusal
Yukai Zhou
Wenjie Wang
AAML
34
15
0
25 Apr 2024
Fake Artificial Intelligence Generated Contents (FAIGC): A Survey of
  Theories, Detection Methods, and Opportunities
Fake Artificial Intelligence Generated Contents (FAIGC): A Survey of Theories, Detection Methods, and Opportunities
Xiaomin Yu
Yezhaohui Wang
Yanfang Chen
Zhen Tao
Dinghao Xi
Shichao Song
Simin Niu
Zhiyu Li
62
7
0
25 Apr 2024
Leveraging Artificial Intelligence to Promote Awareness in Augmented
  Reality Systems
Leveraging Artificial Intelligence to Promote Awareness in Augmented Reality Systems
Wangfan Li
Rohit Mallick
Carlos Toxtli Hernandez
Christopher Flathmann
Nathan J. McNeese
20
0
0
23 Apr 2024
Previous
123...789...111213
Next