ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.02483
  4. Cited By
Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?

5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Jailbroken: How Does LLM Safety Training Fail?"

50 / 636 papers shown
Title
Fake Artificial Intelligence Generated Contents (FAIGC): A Survey of
  Theories, Detection Methods, and Opportunities
Fake Artificial Intelligence Generated Contents (FAIGC): A Survey of Theories, Detection Methods, and Opportunities
Xiaomin Yu
Yezhaohui Wang
Yanfang Chen
Zhen Tao
Dinghao Xi
Shichao Song
Simin Niu
Zhiyu Li
62
7
0
25 Apr 2024
Leveraging Artificial Intelligence to Promote Awareness in Augmented
  Reality Systems
Leveraging Artificial Intelligence to Promote Awareness in Augmented Reality Systems
Wangfan Li
Rohit Mallick
Carlos Toxtli Hernandez
Christopher Flathmann
Nathan J. McNeese
20
0
0
23 Apr 2024
Graph Machine Learning in the Era of Large Language Models (LLMs)
Graph Machine Learning in the Era of Large Language Models (LLMs)
Wenqi Fan
Shijie Wang
Jiani Huang
Zhikai Chen
Yu Song
...
Haitao Mao
Hui Liu
Xiaorui Liu
Dawei Yin
Qing Li
AI4CE
26
23
0
23 Apr 2024
Protecting Your LLMs with Information Bottleneck
Protecting Your LLMs with Information Bottleneck
Zichuan Liu
Zefan Wang
Linjie Xu
Jinyu Wang
Lei Song
Tianchun Wang
Chunlin Chen
Wei Cheng
Jiang Bian
KELM
AAML
51
15
0
22 Apr 2024
Competition Report: Finding Universal Jailbreak Backdoors in Aligned
  LLMs
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Javier Rando
Francesco Croce
Kryvstof Mitka
Stepan Shabalin
Maksym Andriushchenko
Nicolas Flammarion
F. Tramèr
17
14
0
22 Apr 2024
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Anselm Paulus
Arman Zharmagambetov
Chuan Guo
Brandon Amos
Yuandong Tian
AAML
53
55
0
21 Apr 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged
  Instructions
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace
Kai Y. Xiao
R. Leike
Lilian Weng
Johannes Heidecke
Alex Beutel
SILM
47
115
0
19 Apr 2024
Uncovering Safety Risks of Large Language Models through Concept
  Activation Vector
Uncovering Safety Risks of Large Language Models through Concept Activation Vector
Zhihao Xu
Ruixuan Huang
Changyu Chen
Shuai Wang
Xiting Wang
LLMSV
32
10
0
18 Apr 2024
JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large
  Language Models
JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
Yingchaojie Feng
Zhizhang Chen
Zhining Kang
Sijia Wang
Minfeng Zhu
Wei Zhang
Wei Chen
40
3
0
12 Apr 2024
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from
  Human Feedback for LLMs
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
Shreyas Chaudhari
Pranjal Aggarwal
Vishvak Murahari
Tanmay Rajpurohit
A. Kalyan
Karthik Narasimhan
A. Deshpande
Bruno Castro da Silva
21
34
0
12 Apr 2024
Manipulating Large Language Models to Increase Product Visibility
Manipulating Large Language Models to Increase Product Visibility
Aounon Kumar
Himabindu Lakkaraju
33
7
0
11 Apr 2024
AmpleGCG: Learning a Universal and Transferable Generative Model of
  Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
Zeyi Liao
Huan Sun
AAML
39
73
0
11 Apr 2024
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
Bibek Upadhayay
Vahid Behzadan
AAML
21
13
0
09 Apr 2024
Rethinking How to Evaluate Language Model Jailbreak
Rethinking How to Evaluate Language Model Jailbreak
Hongyu Cai
Arjun Arunasalam
Leo Y. Lin
Antonio Bianchi
Z. Berkay Celik
ALM
32
5
0
09 Apr 2024
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM
  Experts
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
Shaona Ghosh
Prasoon Varshney
Erick Galinkin
Christopher Parisien
ELM
38
35
0
09 Apr 2024
Goal-guided Generative Prompt Injection Attack on Large Language Models
Goal-guided Generative Prompt Injection Attack on Large Language Models
Chong Zhang
Mingyu Jin
Qinkai Yu
Chengzhi Liu
Haochen Xue
Xiaobo Jin
AAML
SILM
34
9
0
06 Apr 2024
Taxonomy and Analysis of Sensitive User Queries in Generative AI Search
Taxonomy and Analysis of Sensitive User Queries in Generative AI Search
Hwiyeol Jo
Taiwoo Park
Nayoung Choi
Changbong Kim
Ohjoon Kwon
...
Kyoungho Shin
Sun Suk Lim
Kyungmi Kim
Jihye Lee
Sun Kim
60
0
0
05 Apr 2024
Foundation Model for Advancing Healthcare: Challenges, Opportunities,
  and Future Directions
Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions
Yuting He
Fuxiang Huang
Xinrui Jiang
Yuxiang Nie
Minghao Wang
Jiguang Wang
Hao Chen
LM&MA
AI4CE
71
27
0
04 Apr 2024
Empowering Biomedical Discovery with AI Agents
Empowering Biomedical Discovery with AI Agents
Shanghua Gao
Ada Fang
Yepeng Huang
Valentina Giunchiglia
Ayush Noori
Jonathan Richard Schwarz
Yasha Ektefaie
Jovana Kondic
Marinka Zitnik
LLMAG
AI4CE
39
66
0
03 Apr 2024
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
M. Russinovich
Ahmed Salem
Ronen Eldan
39
77
0
02 Apr 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
81
158
0
02 Apr 2024
Dialectical Alignment: Resolving the Tension of 3H and Security Threats
  of LLMs
Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs
Shu Yang
Jiayuan Su
Han Jiang
Mengdi Li
Keyuan Cheng
Muhammad Asif Ali
Lijie Hu
Di Wang
16
5
0
30 Mar 2024
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large
  Language Models
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao
Edoardo Debenedetti
Alexander Robey
Maksym Andriushchenko
Francesco Croce
...
Nicolas Flammarion
George J. Pappas
F. Tramèr
Hamed Hassani
Eric Wong
ALM
ELM
AAML
52
94
0
28 Mar 2024
Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation
Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation
Yutong He
Alexander Robey
Naoki Murata
Yiding Jiang
J. Williams
George Pappas
Hamed Hassani
Yuki Mitsufuji
Ruslan Salakhutdinov
J. Zico Kolter
DiffM
94
4
0
28 Mar 2024
Exploring the Privacy Protection Capabilities of Chinese Large Language
  Models
Exploring the Privacy Protection Capabilities of Chinese Large Language Models
Yuqi Yang
Xiaowen Huang
Jitao Sang
ELM
PILM
AILaw
41
1
0
27 Mar 2024
Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of
  Large Language Models
Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
Zhiyuan Yu
Xiaogeng Liu
Shunning Liang
Zach Cameron
Chaowei Xiao
Ning Zhang
23
40
0
26 Mar 2024
Optimization-based Prompt Injection Attack to LLM-as-a-Judge
Optimization-based Prompt Injection Attack to LLM-as-a-Judge
Jiawen Shi
Zenghui Yuan
Yinuo Liu
Yue Huang
Pan Zhou
Lichao Sun
Neil Zhenqiang Gong
AAML
43
39
0
26 Mar 2024
Large language models for crowd decision making based on prompt design
  strategies using ChatGPT: models, analysis and challenges
Large language models for crowd decision making based on prompt design strategies using ChatGPT: models, analysis and challenges
Cristina Zuheros
David Herrera-Poyatos
Rosana Montes-Soldado
Francisco Herrera
24
0
0
22 Mar 2024
Risk and Response in Large Language Models: Evaluating Key Threat
  Categories
Risk and Response in Large Language Models: Evaluating Key Threat Categories
Bahareh Harandizadeh
A. Salinas
Fred Morstatter
20
3
0
22 Mar 2024
Detoxifying Large Language Models via Knowledge Editing
Detoxifying Large Language Models via Knowledge Editing
Meng Wang
Ningyu Zhang
Ziwen Xu
Zekun Xi
Shumin Deng
Yunzhi Yao
Qishen Zhang
Linyi Yang
Jindong Wang
Huajun Chen
KELM
38
54
0
21 Mar 2024
From Representational Harms to Quality-of-Service Harms: A Case Study on
  Llama 2 Safety Safeguards
From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards
Khaoula Chehbouni
Megha Roshan
Emmanuel Ma
Futian Andrew Wei
Afaf Taik
Jackie CK Cheung
G. Farnadi
32
7
0
20 Mar 2024
EasyJailbreak: A Unified Framework for Jailbreaking Large Language
  Models
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models
Weikang Zhou
Xiao Wang
Limao Xiong
Han Xia
Yingshuang Gu
...
Lijun Li
Jing Shao
Tao Gui
Qi Zhang
Xuanjing Huang
73
31
0
18 Mar 2024
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Ji-Rong Wen
50
38
0
14 Mar 2024
Tastle: Distract Large Language Models for Automatic Jailbreak Attack
Tastle: Distract Large Language Models for Automatic Jailbreak Attack
Zeguan Xiao
Yan Yang
Guanhua Chen
Yun-Nung Chen
AAML
35
17
0
13 Mar 2024
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
Egor Zverev
Sahar Abdelnabi
Soroush Tabesh
Mario Fritz
Christoph H. Lampert
43
19
0
11 Mar 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial
  Training
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Stephen Casper
Lennart Schulze
Oam Patel
Dylan Hadfield-Menell
AAML
49
27
0
08 Mar 2024
Automatic and Universal Prompt Injection Attacks against Large Language
  Models
Automatic and Universal Prompt Injection Attacks against Large Language Models
Xiaogeng Liu
Zhiyuan Yu
Yizhe Zhang
Ning Zhang
Chaowei Xiao
SILM
AAML
38
33
0
07 Mar 2024
A Safe Harbor for AI Evaluation and Red Teaming
A Safe Harbor for AI Evaluation and Red Teaming
Shayne Longpre
Sayash Kapoor
Kevin Klyman
Ashwin Ramaswami
Rishi Bommasani
...
Daniel Kang
Sandy Pentland
Arvind Narayanan
Percy Liang
Peter Henderson
49
38
0
07 Mar 2024
Neural Exec: Learning (and Learning from) Execution Triggers for Prompt
  Injection Attacks
Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
Dario Pasquini
Martin Strohmeier
Carmela Troncoso
AAML
26
21
0
06 Mar 2024
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li
Alexander Pan
Anjali Gopal
Summer Yue
Daniel Berrios
...
Yan Shoshitaishvili
Jimmy Ba
K. Esvelt
Alexandr Wang
Dan Hendrycks
ELM
43
140
0
05 Mar 2024
Enhancing LLM Safety via Constrained Direct Preference Optimization
Enhancing LLM Safety via Constrained Direct Preference Optimization
Zixuan Liu
Xiaolin Sun
Zizhan Zheng
28
20
0
04 Mar 2024
Breaking Down the Defenses: A Comparative Survey of Attacks on Large
  Language Models
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
Arijit Ghosh Chowdhury
Md. Mofijul Islam
Vaibhav Kumar
F. H. Shezan
Vaibhav Kumar
Vinija Jain
Aman Chadha
AAML
PILM
34
29
0
03 Mar 2024
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
Yifan Zeng
Yiran Wu
Xiao Zhang
Huazheng Wang
Qingyun Wu
LLMAG
AAML
35
60
0
02 Mar 2024
AutoAttacker: A Large Language Model Guided System to Implement
  Automatic Cyber-attacks
AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks
Jiacen Xu
Jack W. Stokes
Geoff McDonald
Xuesong Bai
David Marshall
Siyue Wang
Adith Swaminathan
Zhou Li
40
49
0
02 Mar 2024
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by
  Exploring Refusal Loss Landscapes
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
Xiaomeng Hu
Pin-Yu Chen
Tsung-Yi Ho
AAML
24
26
0
01 Mar 2024
Controllable Preference Optimization: Toward Controllable
  Multi-Objective Alignment
Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment
Yiju Guo
Ganqu Cui
Lifan Yuan
Ning Ding
Jiexin Wang
...
Ruobing Xie
Jie Zhou
Yankai Lin
Zhiyuan Liu
Maosong Sun
36
56
0
29 Feb 2024
Making Them Ask and Answer: Jailbreaking Large Language Models in Few
  Queries via Disguise and Reconstruction
Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction
Tong Liu
Yingjie Zhang
Zhe Zhao
Yinpeng Dong
Guozhu Meng
Kai Chen
AAML
43
44
0
28 Feb 2024
Adversarial Math Word Problem Generation
Adversarial Math Word Problem Generation
Roy Xie
Chengxuan Huang
Junlin Wang
Bhuwan Dhingra
AAML
28
1
0
27 Feb 2024
Follow My Instruction and Spill the Beans: Scalable Data Extraction from
  Retrieval-Augmented Generation Systems
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
Zhenting Qi
Hanlin Zhang
Eric Xing
Sham Kakade
Hima Lakkaraju
SILM
40
18
0
27 Feb 2024
Securing Reliability: A Brief Overview on Enhancing In-Context Learning
  for Foundation Models
Securing Reliability: A Brief Overview on Enhancing In-Context Learning for Foundation Models
Yunpeng Huang
Yaonan Gu
Jingwei Xu
Zhihong Zhu
Zhaorun Chen
Xiaoxing Ma
35
3
0
27 Feb 2024
Previous
123...1011121389
Next