ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.02483
  4. Cited By
Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?

5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Jailbroken: How Does LLM Safety Training Fail?"

50 / 634 papers shown
Title
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language
  Models
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models
Hongfu Liu
Yuxi Xie
Ye Wang
Michael Shieh
41
2
0
27 Aug 2024
LLM-PBE: Assessing Data Privacy in Large Language Models
LLM-PBE: Assessing Data Privacy in Large Language Models
Qinbin Li
Junyuan Hong
Chulin Xie
Jeffrey Tan
Rachel Xin
...
Dan Hendrycks
Zhangyang Wang
Bo Li
Bingsheng He
Dawn Song
ELM
PILM
36
12
0
23 Aug 2024
Towards "Differential AI Psychology" and in-context Value-driven
  Statement Alignment with Moral Foundations Theory
Towards "Differential AI Psychology" and in-context Value-driven Statement Alignment with Moral Foundations Theory
Simon Münker
SyDa
19
0
0
21 Aug 2024
EEG-Defender: Defending against Jailbreak through Early Exit Generation
  of Large Language Models
EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models
Chongwen Zhao
Zhihao Dou
Kaizhu Huang
AAML
27
0
0
21 Aug 2024
MEGen: Generative Backdoor in Large Language Models via Model Editing
MEGen: Generative Backdoor in Large Language Models via Model Editing
Jiyang Qiu
Xinbei Ma
Zhuosheng Zhang
Hai Zhao
AAML
KELM
SILM
23
3
0
20 Aug 2024
Tracing Privacy Leakage of Language Models to Training Data via Adjusted
  Influence Functions
Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions
Jinxin Liu
Zao Yang
23
1
0
20 Aug 2024
Characterizing and Evaluating the Reliability of LLMs against Jailbreak
  Attacks
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
Kexin Chen
Yi Liu
Dongxia Wang
Jiaying Chen
Wenhai Wang
44
1
0
18 Aug 2024
$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and
  Defenses for Vision Language Models
MMJ-Bench\textit{MMJ-Bench}MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models
Fenghua Weng
Yue Xu
Chengyan Fu
Wenjie Wang
AAML
35
1
0
16 Aug 2024
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov
  Decision Processes and Tree Search
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search
Robert J. Moss
AAML
21
0
0
11 Aug 2024
Can a Bayesian Oracle Prevent Harm from an Agent?
Can a Bayesian Oracle Prevent Harm from an Agent?
Yoshua Bengio
Michael K. Cohen
Nikolay Malkin
Matt MacDermott
Damiano Fornasiere
Pietro Greiner
Younesse Kaddar
37
4
0
09 Aug 2024
Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective
  Alignment with Contrastive Prompts
Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts
Tingchen Fu
Yupeng Hou
Julian McAuley
Rui Yan
28
3
0
09 Aug 2024
Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large
  Language Models
Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models
Fabio Pernisi
Dirk Hovy
Paul Röttger
44
0
0
08 Aug 2024
Multi-Turn Context Jailbreak Attack on Large Language Models From First
  Principles
Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles
Xiongtao Sun
Deyue Zhang
Dongdong Yang
Quanchen Zou
Hui Li
AAML
29
11
0
08 Aug 2024
EnJa: Ensemble Jailbreak on Large Language Models
EnJa: Ensemble Jailbreak on Large Language Models
Jiahao Zhang
Zilong Wang
Ruofan Wang
Xingjun Ma
Yu-Gang Jiang
AAML
16
1
0
07 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Jingtong Su
Mingyu Lee
SangKeun Lee
33
7
0
02 Aug 2024
Misinforming LLMs: vulnerabilities, challenges and opportunities
Misinforming LLMs: vulnerabilities, challenges and opportunities
Jaroslaw Kornowicz
Daniel Geissler
Kirsten Thommes
25
2
0
02 Aug 2024
Autonomous LLM-Enhanced Adversarial Attack for Text-to-Motion
Autonomous LLM-Enhanced Adversarial Attack for Text-to-Motion
Honglei Miao
Fan Ma
Ruijie Quan
Kun Zhan
Yi Yang
AAML
34
0
0
01 Aug 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
47
37
0
01 Aug 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Richard Ren
Steven Basart
Adam Khoja
Alice Gatti
Long Phan
...
Alexander Pan
Gabriel Mukobi
Ryan H. Kim
Stephen Fitz
Dan Hendrycks
ELM
26
19
0
31 Jul 2024
Machine Unlearning in Generative AI: A Survey
Machine Unlearning in Generative AI: A Survey
Zheyuan Liu
Guangyao Dou
Zhaoxuan Tan
Yijun Tian
Meng-Long Jiang
MU
31
13
0
30 Jul 2024
LocalValueBench: A Collaboratively Built and Extensible Benchmark for
  Evaluating Localized Value Alignment and Ethical Safety in Large Language
  Models
LocalValueBench: A Collaboratively Built and Extensible Benchmark for Evaluating Localized Value Alignment and Ethical Safety in Large Language Models
Achintya Gopal
Nicholas Wai Long Lau
Eva Adelina Susanto
Chi Lok Yu
Aditya Paul
ELM
19
7
0
27 Jul 2024
GermanPartiesQA: Benchmarking Commercial Large Language Models for
  Political Bias and Sycophancy
GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and Sycophancy
Jan Batzner
Volker Stocker
Stefan Schmid
Gjergji Kasneci
20
1
0
25 Jul 2024
The Dark Side of Function Calling: Pathways to Jailbreaking Large
  Language Models
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
Zihui Wu
Haichang Gao
Jianping He
Ping Wang
19
6
0
25 Jul 2024
Can Large Language Models Automatically Jailbreak GPT-4V?
Can Large Language Models Automatically Jailbreak GPT-4V?
Yuanwei Wu
Yue Huang
Yixin Liu
Xiang Li
Pan Zhou
Lichao Sun
SILM
30
1
0
23 Jul 2024
RedAgent: Red Teaming Large Language Models with Context-aware
  Autonomous Language Agent
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
Huiyu Xu
Wenhui Zhang
Zhibo Wang
Feng Xiao
Rui Zheng
Yunhe Feng
Zhongjie Ba
Kui Ren
AAML
LLMAG
29
11
0
23 Jul 2024
Course-Correction: Safety Alignment Using Synthetic Preferences
Course-Correction: Safety Alignment Using Synthetic Preferences
Rongwu Xu
Yishuo Cai
Z. Zhou
Renjie Gu
Haiqin Weng
Yan Liu
Tianwei Zhang
Wei Xu
Han Qiu
29
4
0
23 Jul 2024
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
Shi Lin
Rongchang Li
Xun Wang
Changting Lin
Xun Wang
Wenpeng Xing
Meng Han
Meng Han
53
3
0
23 Jul 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Meng Wang
Yunzhi Yao
Ziwen Xu
Shuofei Qiao
Shumin Deng
...
Yong-jia Jiang
Pengjun Xie
Fei Huang
Huajun Chen
Ningyu Zhang
47
27
0
22 Jul 2024
LLMmap: Fingerprinting For Large Language Models
LLMmap: Fingerprinting For Large Language Models
Dario Pasquini
Evgenios M. Kornaropoulos
G. Ateniese
50
6
0
22 Jul 2024
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal
  Mechanisms and the Superficial Hypothesis
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis
Guang-Da Liu
Haitao Mao
Jiliang Tang
K. Johnson
LRM
24
7
0
21 Jul 2024
Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and
  Semantically-Rich Vision-Language Models
Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models
Md Zarif Hossain
Ahmed Imteaj
VLM
AAML
29
4
0
20 Jul 2024
Analyzing the Generalization and Reliability of Steering Vectors
Analyzing the Generalization and Reliability of Steering Vectors
Daniel Tan
David Chanin
Aengus Lynch
Dimitrios Kanoulas
Brooks Paige
Adrià Garriga-Alonso
Robert Kirk
LLMSV
84
16
0
17 Jul 2024
The Better Angels of Machine Personality: How Personality Relates to LLM
  Safety
The Better Angels of Machine Personality: How Personality Relates to LLM Safety
Jie M. Zhang
Dongrui Liu
Chao Qian
Ziyue Gan
Yong-jin Liu
Yu Qiao
Jing Shao
LLMAG
PILM
40
12
0
17 Jul 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
42
27
0
16 Jul 2024
BadRobot: Jailbreaking Embodied LLMs in the Physical World
BadRobot: Jailbreaking Embodied LLMs in the Physical World
Hangtao Zhang
Chenyu Zhu
Xianlong Wang
Ziqi Zhou
Yichen Wang
...
Shengshan Hu
Leo Yu Zhang
Aishan Liu
Peijin Guo
Leo Yu Zhang
LM&Ro
40
6
0
16 Jul 2024
Uncertainty is Fragile: Manipulating Uncertainty in Large Language
  Models
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models
Qingcheng Zeng
Mingyu Jin
Qinkai Yu
Zhenting Wang
Wenyue Hua
...
Felix Juefei Xu
Kaize Ding
Fan Yang
Ruixiang Tang
Yongfeng Zhang
AAML
31
10
0
15 Jul 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled
  Refusal Training
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
35
19
0
12 Jul 2024
ProxyGPT: Enabling Anonymous Queries in AI Chatbots with (Un)Trustworthy
  Browser Proxies
ProxyGPT: Enabling Anonymous Queries in AI Chatbots with (Un)Trustworthy Browser Proxies
Dzung Pham
J. Sheffey
Chau Minh Pham
Amir Houmansadr
24
1
0
11 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language
  Mixture
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song
Yuheng Huang
Zhehua Zhou
Lei Ma
37
6
0
10 Jul 2024
Grounding and Evaluation for Large Language Models: Practical Challenges
  and Lessons Learned (Survey)
Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)
K. Kenthapadi
M. Sameki
Ankur Taly
HILM
ELM
AILaw
34
12
0
10 Jul 2024
LIONs: An Empirically Optimized Approach to Align Language Models
LIONs: An Empirically Optimized Approach to Align Language Models
Xiao Yu
Qingyang Wu
Yu Li
Zhou Yu
ALM
32
3
0
09 Jul 2024
$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via
  Knowledge-Enhanced Logical Reasoning
R2R^2R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Mintong Kang
Bo-wen Li
LRM
29
12
0
08 Jul 2024
Large Language Model as an Assignment Evaluator: Insights, Feedback, and
  Challenges in a 1000+ Student Course
Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course
Cheng-Han Chiang
Wei-Chih Chen
Chun-Yi Kuan
Chienchou Yang
Hung-yi Lee
ELM
AI4Ed
35
5
0
07 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
34
79
0
05 Jul 2024
Single Character Perturbations Break LLM Alignment
Single Character Perturbations Break LLM Alignment
Leon Lin
Hannah Brown
Kenji Kawaguchi
Michael Shieh
AAML
61
2
0
03 Jul 2024
SOS! Soft Prompt Attack Against Open-Source Large Language Models
SOS! Soft Prompt Attack Against Open-Source Large Language Models
Ziqing Yang
Michael Backes
Yang Zhang
Ahmed Salem
AAML
30
6
0
03 Jul 2024
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts
  Discovery from Large-Scale Human-LLM Conversational Datasets
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets
Zhihua Jin
Shiyi Liu
Haotian Li
Xun Zhao
Huamin Qu
25
3
0
03 Jul 2024
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content
  Moderation of Large Language Models
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
Hayder Elesedy
Pedro M. Esperança
Silviu Vlad Oprea
Mete Ozay
KELM
18
2
0
03 Jul 2024
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to
  Defend Against Jailbreak Attacks
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
Zhexin Zhang
Junxiao Yang
Pei Ke
Shiyao Cui
Chujie Zheng
Hongning Wang
Minlie Huang
AAML
MU
31
26
0
03 Jul 2024
Purple-teaming LLMs with Adversarial Defender Training
Purple-teaming LLMs with Adversarial Defender Training
Jingyan Zhou
Kun Li
Junan Li
Jiawen Kang
Minda Hu
Xixin Wu
Helen Meng
AAML
27
1
0
01 Jul 2024
Previous
123...567...111213
Next