ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.02483
  4. Cited By
Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?

5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Jailbroken: How Does LLM Safety Training Fail?"

50 / 634 papers shown
Title
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
73
1
0
09 Oct 2024
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Xinyi Zeng
Yuying Shang
Yutao Zhu
Jingyuan Zhang
Yu Tian
AAML
75
2
0
09 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min-Bin Lin
33
8
0
09 Oct 2024
Non-Halting Queries: Exploiting Fixed Points in LLMs
Non-Halting Queries: Exploiting Fixed Points in LLMs
Ghaith Hammouri
Kemal Derya
B. Sunar
28
0
0
08 Oct 2024
Superficial Safety Alignment Hypothesis
Superficial Safety Alignment Hypothesis
Jianwei Li
Jung-Eun Kim
24
1
0
07 Oct 2024
Collaboration! Towards Robust Neural Methods for Routing Problems
Collaboration! Towards Robust Neural Methods for Routing Problems
Jianan Zhou
Yaoxin Wu
Zhiguang Cao
Wen Song
Jie Zhang
Zhiqi Shen
AAML
23
3
0
07 Oct 2024
Alignment Between the Decision-Making Logic of LLMs and Human Cognition:
  A Case Study on Legal LLMs
Alignment Between the Decision-Making Logic of LLMs and Human Cognition: A Case Study on Legal LLMs
Lu Chen
Yuxuan Huang
Yixing Li
Yaohui Jin
Shuai Zhao
Zilong Zheng
Quanshi Zhang
21
1
0
06 Oct 2024
Harnessing Task Overload for Scalable Jailbreak Attacks on Large
  Language Models
Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models
Yiting Dong
Guobin Shen
Dongcheng Zhao
Xiang-Yu He
Yi Zeng
34
0
0
05 Oct 2024
ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in
  LLMs
ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs
Lu Yan
Siyuan Cheng
Xuan Chen
Kaiyuan Zhang
Guangyu Shen
Zhuo Zhang
Xiangyu Zhang
AAML
SILM
18
0
0
05 Oct 2024
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step
  by Step
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step
Wenxuan Wang
Kuiyi Gao
Zihan Jia
Youliang Yuan
Jen-tse Huang
Qiuzhi Liu
Shuai Wang
Wenxiang Jiao
Zhaopeng Tu
62
2
0
04 Oct 2024
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
Tianyu Wu
Lingrui Mei
Ruibin Yuan
Lujun Li
Wei Xue
Yike Guo
35
1
0
04 Oct 2024
Gradient-based Jailbreak Images for Multimodal Fusion Models
Gradient-based Jailbreak Images for Multimodal Fusion Models
Javier Rando
Hannah Korevaar
Erik Brinkman
Ivan Evtimov
Florian Tramèr
AAML
29
3
0
04 Oct 2024
Can Watermarked LLMs be Identified by Users via Crafted Prompts?
Can Watermarked LLMs be Identified by Users via Crafted Prompts?
Aiwei Liu
Sheng Guan
Y. Liu
L. Pan
Yifei Zhang
Liancheng Fang
Lijie Wen
Philip S. Yu
Xuming Hu
WaLM
89
2
0
04 Oct 2024
HiddenGuard: Fine-Grained Safe Generation with Specialized
  Representation Router
HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router
Lingrui Mei
Shenghua Liu
Yiwei Wang
Baolong Bi
Ruibin Yuan
Xueqi Cheng
33
4
0
03 Oct 2024
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
Xiaogeng Liu
Peiran Li
Edward Suh
Yevgeniy Vorobeychik
Zhuoqing Mao
Somesh Jha
Patrick McDaniel
Huan Sun
Bo Li
Chaowei Xiao
32
17
0
03 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang-Yu He
Yi Zeng
AAML
45
0
0
03 Oct 2024
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
Maya Pavlova
Erik Brinkman
Krithika Iyer
Vítor Albiero
Joanna Bitton
Hailey Nguyen
J. Li
Cristian Canton Ferrer
Ivan Evtimov
Aaron Grattafiori
ALM
26
8
0
02 Oct 2024
FlipAttack: Jailbreak LLMs via Flipping
FlipAttack: Jailbreak LLMs via Flipping
Yue Liu
Xiaoxin He
Miao Xiong
Jinlan Fu
Shumin Deng
Bryan Hooi
AAML
34
12
0
02 Oct 2024
Endless Jailbreaks with Bijection Learning
Endless Jailbreaks with Bijection Learning
Brian R. Y. Huang
Maximilian Li
Leonard Tang
AAML
73
5
0
02 Oct 2024
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
Seanie Lee
Haebin Seong
Dong Bok Lee
Minki Kang
Xiaoyin Chen
Dominik Wagner
Yoshua Bengio
Juho Lee
Sung Ju Hwang
65
2
0
02 Oct 2024
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data
Xuefeng Du
Reshmi Ghosh
Robert Sim
Ahmed Salem
Vitor Carvalho
Emily Lawton
Yixuan Li
Jack W. Stokes
VLM
AAML
35
5
0
01 Oct 2024
GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending
  Against Prompt Injection Attacks
GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks
Rongchang Li
Minjie Chen
Chang Hu
Han Chen
Wenpeng Xing
Meng Han
SILM
ELM
26
1
0
29 Sep 2024
Multimodal Pragmatic Jailbreak on Text-to-image Models
Multimodal Pragmatic Jailbreak on Text-to-image Models
Tong Liu
Zhixin Lai
Gengyuan Zhang
Philip H. S. Torr
Vera Demberg
Volker Tresp
Jindong Gu
30
4
0
27 Sep 2024
Ruler: A Model-Agnostic Method to Control Generated Length for Large
  Language Models
Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models
Jiaming Li
Lei Zhang
Yunshui Li
Ziqiang Liu
Yuelin Bai
Run Luo
Longze Chen
Min Yang
ALM
25
0
0
27 Sep 2024
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A
  Survey
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
AAML
40
21
0
26 Sep 2024
RED QUEEN: Safeguarding Large Language Models against Concealed
  Multi-Turn Jailbreaking
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
Yifan Jiang
Kriti Aggarwal
Tanmay Laud
Kashif Munir
Jay Pujara
Subhabrata Mukherjee
AAML
46
10
0
26 Sep 2024
An Adversarial Perspective on Machine Unlearning for AI Safety
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki
Boyi Wei
Yangsibo Huang
Peter Henderson
F. Tramèr
Javier Rando
MU
AAML
71
31
0
26 Sep 2024
Data-Centric AI Governance: Addressing the Limitations of Model-Focused
  Policies
Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies
Ritwik Gupta
Leah Walker
Rodolfo Corona
Stephanie Fu
Suzanne Petryk
Janet Napolitano
Trevor Darrell
Andrew W. Reddie
ELM
30
3
0
25 Sep 2024
LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ
LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ
Marc-Antoine Allard
Matin Ansaripour
Maria Yuffa
Paul Teiletche
LRM
13
0
0
25 Sep 2024
RMCBench: Benchmarking Large Language Models' Resistance to Malicious
  Code
RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code
Jiachi Chen
Qingyuan Zhong
Yanlin Wang
Kaiwen Ning
Yongkun Liu
Zenan Xu
Zhe Zhao
Ting Chen
Zibin Zheng
AAML
18
7
0
23 Sep 2024
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in
  Red Teaming GenAI
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI
Ambrish Rawat
Stefan Schoepf
Giulio Zizzo
Giandomenico Cornacchia
Muhammad Zaid Hameed
...
Elizabeth M. Daly
Mark Purcell
P. Sattigeri
Pin-Yu Chen
Kush R. Varshney
AAML
40
6
0
23 Sep 2024
PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs
PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs
Jiahao Yu
Yangguang Shao
Hanwen Miao
Junzheng Shi
SILM
AAML
67
4
0
23 Sep 2024
Backtracking Improves Generation Safety
Backtracking Improves Generation Safety
Yiming Zhang
Jianfeng Chi
Hailey Nguyen
Kartikeya Upasani
Daniel M. Bikel
Jason Weston
Eric Michael Smith
SILM
41
6
0
22 Sep 2024
The Imperative of Conversation Analysis in the Era of LLMs: A Survey of
  Tasks, Techniques, and Trends
The Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks, Techniques, and Trends
Xinghua Zhang
Haiyang Yu
Yongbin Li
Minzheng Wang
Longze Chen
Fei Huang
35
5
0
21 Sep 2024
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement
  Learning-Based Jailbreak Approach
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach
Zhihao Lin
Wei Ma
Mingyi Zhou
Yanjie Zhao
Haoyu Wang
Yang Liu
Jun Wang
Li Li
AAML
30
5
0
21 Sep 2024
Manipulation Facing Threats: Evaluating Physical Vulnerabilities in
  End-to-End Vision Language Action Models
Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models
Hao Cheng
Erjia Xiao
Chengyuan Yu
Zhao Yao
Jiahang Cao
...
Jiaxu Wang
Mengshu Sun
Kaidi Xu
Jindong Gu
Renjing Xu
AAML
24
1
0
20 Sep 2024
Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs
  Fine-tuning
Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning
Essa Jan
Nouar Aldahoul
Moiz Ali
Faizan Ahmad
Fareed Zaffar
Yasir Zaki
21
3
0
18 Sep 2024
LLM-Agent-UMF: LLM-based Agent Unified Modeling Framework for Seamless
  Integration of Multi Active/Passive Core-Agents
LLM-Agent-UMF: LLM-based Agent Unified Modeling Framework for Seamless Integration of Multi Active/Passive Core-Agents
Amine B. Hassouna
Hana Chaari
Ines Belhaj
LLMAG
30
1
0
17 Sep 2024
Jailbreaking Large Language Models with Symbolic Mathematics
Jailbreaking Large Language Models with Symbolic Mathematics
Emet Bethany
Mazal Bethany
Juan Arturo Nolazco Flores
S. Jha
Peyman Najafirad
AAML
16
3
0
17 Sep 2024
Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks
  against RAG-based Inference in Scale and Severity Using Jailbreaking
Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking
Stav Cohen
Ron Bitton
Ben Nassi
34
4
0
12 Sep 2024
Securing Vision-Language Models with a Robust Encoder Against Jailbreak
  and Adversarial Attacks
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks
Md Zarif Hossain
Ahmed Imteaj
AAML
VLM
38
3
0
11 Sep 2024
Exploring Straightforward Conversational Red-Teaming
Exploring Straightforward Conversational Red-Teaming
George Kour
Naama Zwerdling
Marcel Zalmanovici
Ateret Anaby-Tavor
Ora Nova Fandina
E. Farchi
AAML
71
1
0
07 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
52
1
0
05 Sep 2024
Towards a Unified View of Preference Learning for Large Language Models:
  A Survey
Towards a Unified View of Preference Learning for Large Language Models: A Survey
Bofei Gao
Feifan Song
Yibo Miao
Zefan Cai
Z. Yang
...
Houfeng Wang
Zhifang Sui
Peiyi Wang
Baobao Chang
Baobao Chang
41
11
0
04 Sep 2024
LLM-GAN: Construct Generative Adversarial Network Through Large Language
  Models For Explainable Fake News Detection
LLM-GAN: Construct Generative Adversarial Network Through Large Language Models For Explainable Fake News Detection
Yifeng Wang
Zhouhong Gu
Siwei Zhang
Suhang Zheng
Tao Wang
Tianyu Li
Hongwei Feng
Yanghua Xiao
29
0
0
03 Sep 2024
Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak
  Attacks
Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks
Tom Gibbs
Ethan Kosak-Hine
George Ingebretsen
Jason Zhang
Julius Broomfield
Sara Pieri
Reihaneh Iranmanesh
Reihaneh Rabbany
Kellin Pelrine
AAML
28
6
0
29 Aug 2024
Acceptable Use Policies for Foundation Models
Acceptable Use Policies for Foundation Models
Kevin Klyman
26
14
0
29 Aug 2024
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational
  Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
Aman Priyanshu
Supriti Vijay
AAML
23
1
0
28 Aug 2024
Legilimens: Practical and Unified Content Moderation for Large Language
  Model Services
Legilimens: Practical and Unified Content Moderation for Large Language Model Services
Jialin Wu
Jiangyi Deng
Shengyuan Pang
Yanjiao Chen
Jiayang Xu
Xinfeng Li
Wenyuan Xu
32
6
0
28 Aug 2024
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Nathaniel Li
Ziwen Han
Ian Steneker
Willow Primack
Riley Goodside
Hugh Zhang
Zifan Wang
Cristina Menghini
Summer Yue
AAML
MU
44
39
0
27 Aug 2024
Previous
123456...111213
Next