Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.02483
Cited By
Jailbroken: How Does LLM Safety Training Fail?
5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Jailbroken: How Does LLM Safety Training Fail?"
50 / 634 papers shown
Title
Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement
Zisu Huang
Xiaohua Wang
Feiran Zhang
Zhibo Xu
Cenyuan Zhang
Xiaoqing Zheng
Xuanjing Huang
AAML
LRM
32
4
0
01 Jul 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi
Alexander Wei
Eric Wallace
Tony T. Wang
Nika Haghtalab
Jacob Steinhardt
SILM
AAML
37
28
0
28 Jun 2024
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection
Yuqi Zhou
Lin Lu
Hanchi Sun
Pan Zhou
Lichao Sun
31
9
0
28 Jun 2024
Rethinking harmless refusals when fine-tuning foundation models
Florin Pop
Judd Rosenblatt
Diogo Schwerz de Lucena
Michael Vaiana
16
0
0
27 Jun 2024
Revealing Fine-Grained Values and Opinions in Large Language Models
Dustin Wright
Arnav Arora
Nadav Borenstein
Srishti Yadav
Serge J. Belongie
Isabelle Augenstein
33
1
0
27 Jun 2024
FernUni LLM Experimental Infrastructure (FLEXI) -- Enabling Experimentation and Innovation in Higher Education Through Access to Open Large Language Models
Torsten Zesch
Michael Hanses
Niels Seidel
Piush Aggarwal
Dirk Veiel
Claudia de Witt
21
0
0
27 Jun 2024
A Survey on Privacy Attacks Against Digital Twin Systems in AI-Robotics
Ivan A. Fernandez
Subash Neupane
Trisha Chakraborty
Shaswata Mitra
Sudip Mittal
Nisha Pillai
Jingdao Chen
Shahram Rahimi
52
1
0
27 Jun 2024
Jailbreaking LLMs with Arabic Transliteration and Arabizi
Mansour Al Ghanim
Saleh Almohaimeed
Mengxin Zheng
Yan Solihin
Qian Lou
34
2
0
26 Jun 2024
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Aakanksha
Arash Ahmadian
B. Ermiş
Seraphina Goldfarb-Tarrant
Julia Kreutzer
Marzieh Fadaee
Sara Hooker
40
28
0
26 Jun 2024
Poisoned LangChain: Jailbreak LLMs by LangChain
Ziqiu Wang
Jun Liu
Shengkai Zhang
Yang Yang
28
7
0
26 Jun 2024
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
Caishuang Huang
Wanxu Zhao
Rui Zheng
Huijie Lv
Shihan Dou
...
Junjie Ye
Yuming Yang
Tao Gui
Qi Zhang
Xuanjing Huang
LLMSV
AAML
45
7
0
26 Jun 2024
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models
Zhen Tan
Chengshuai Zhao
Raha Moraffah
Yifan Li
Song Wang
Jundong Li
Tianlong Chen
Huan Liu
SILM
46
16
0
26 Jun 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
36
26
0
26 Jun 2024
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization
Zhengyue Zhao
Xiaoyun Zhang
Kaidi Xu
Xing Hu
Rui Zhang
Zidong Du
Qi Guo
Yunji Chen
22
5
0
24 Jun 2024
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
Siyuan Wang
Zhuohan Long
Zhihao Fan
Zhongyu Wei
37
6
0
21 Jun 2024
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland
Alexander Lyzhov
Jacob Pfau
Salsabila Mahdi
Samuel R. Bowman
LLMSV
AAML
57
17
0
21 Jun 2024
Pareto-Optimal Learning from Preferences with Hidden Context
Ryan Boldi
Li Ding
Lee Spector
S. Niekum
54
6
0
21 Jun 2024
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
Anton Xue
Avishree Khare
Rajeev Alur
Surbhi Goel
Eric Wong
46
2
0
21 Jun 2024
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
Hasan Hammoud
Umberto Michieli
Fabio Pizzati
Philip H. S. Torr
Adel Bibi
Bernard Ghanem
Mete Ozay
MoMe
31
14
0
20 Jun 2024
Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems
Đorđe Klisura
Anthony Rios
AAML
19
1
0
20 Jun 2024
Adversaries Can Misuse Combinations of Safe Models
Erik Jones
Anca Dragan
Jacob Steinhardt
40
6
0
20 Jun 2024
Prompt Injection Attacks in Defended Systems
Daniil Khomsky
Narek Maloyan
Bulat Nutfullin
AAML
SILM
19
3
0
20 Jun 2024
BeHonest: Benchmarking Honesty in Large Language Models
Steffi Chern
Zhulin Hu
Yuqing Yang
Ethan Chern
Yuan Guo
Jiahe Jin
Binjie Wang
Pengfei Liu
HILM
ALM
84
3
0
19 Jun 2024
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation
Xiaoze Liu
Ting Sun
Tianyang Xu
Feijie Wu
Cunxiang Wang
Xiaoqian Wang
Jing Gao
AAML
DeLMO
AILaw
39
15
0
18 Jun 2024
[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
Abhinav Rao
Monojit Choudhury
Somak Aditya
24
0
0
18 Jun 2024
Who's asking? User personas and the mechanics of latent misalignment
Asma Ghandeharioun
Ann Yuan
Marius Guerard
Emily Reif
Michael A. Lepori
Lucas Dixon
LLMSV
41
7
0
17 Jun 2024
Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs
S. Kadhe
Farhan Ahmed
Dennis Wei
Nathalie Baracaldo
Inkit Padhi
MoMe
MU
26
6
0
17 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
45
132
0
17 Jun 2024
Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
Shangqing Tu
Zhuoran Pan
Wenxuan Wang
Zhexin Zhang
Yuliang Sun
Jifan Yu
Hongning Wang
Lei Hou
Juanzi Li
ALM
42
1
0
17 Jun 2024
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Yihuai Hong
Lei Yu
Shauli Ravfogel
Haiqin Yang
Mor Geva
KELM
MU
58
17
0
17 Jun 2024
MoE-RBench
\texttt{MoE-RBench}
MoE-RBench
: Towards Building Reliable Language Models with Sparse Mixture-of-Experts
Guanjie Chen
Xinyu Zhao
Tianlong Chen
Yu Cheng
MoE
62
5
0
17 Jun 2024
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
Lingrui Mei
Shenghua Liu
Yiwei Wang
Baolong Bi
Jiayi Mao
Xueqi Cheng
AAML
45
9
0
17 Jun 2024
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models
Yuqing Wang
Yun Zhao
LRM
AAML
ELM
27
1
0
16 Jun 2024
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
Yuping Lin
Pengfei He
Han Xu
Yue Xing
Makoto Yamada
Hui Liu
Jiliang Tang
27
10
0
16 Jun 2024
ElicitationGPT: Text Elicitation Mechanisms via Language Models
Yifan Wu
Jason D. Hartline
36
3
0
13 Jun 2024
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Zhao Xu
Fan Liu
Hao Liu
AAML
40
7
0
13 Jun 2024
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
Sarah Ball
Frauke Kreuter
Nina Rimsky
24
13
0
13 Jun 2024
RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs
Xuan Chen
Yuzhou Nie
Lu Yan
Yunshu Mao
Wenbo Guo
Xiangyu Zhang
21
7
0
13 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
42
10
0
13 Jun 2024
We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs
Joseph Spracklen
Raveen Wijewickrama
A. H. M. N. Sakib
Anindya Maiti
Murtuza Jadliwala
Murtuza Jadliwala
40
10
0
12 Jun 2024
Merging Improves Self-Critique Against Jailbreak Attacks
Victor Gallego
AAML
MoMe
36
3
0
11 Jun 2024
Survey for Landing Generative AI in Social and E-commerce Recsys -- the Industry Perspectives
Da Xu
Danqing Zhang
Guangyu Yang
Bo Yang
Shuyuan Xu
Lingling Zheng
Cindy Liang
27
2
0
10 Jun 2024
Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks
Zonghao Ying
Aishan Liu
Xianglong Liu
Dacheng Tao
54
16
0
10 Jun 2024
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Yongbin Li
24
26
0
09 Jun 2024
Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents
Avital Shafran
R. Schuster
Vitaly Shmatikov
37
27
0
09 Jun 2024
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Xunguang Wang
Daoyuan Wu
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Shuai Wang
Yingjiu Li
Yang Liu
Ning Liu
Juergen Rahmel
AAML
71
8
0
08 Jun 2024
Improving Alignment and Robustness with Circuit Breakers
Andy Zou
Long Phan
Justin Wang
Derek Duenas
Maxwell Lin
Maksym Andriushchenko
Rowan Wang
Zico Kolter
Matt Fredrikson
Dan Hendrycks
AAML
36
71
0
06 Jun 2024
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt
Zonghao Ying
Aishan Liu
Tianyuan Zhang
Zhengmin Yu
Siyuan Liang
Xianglong Liu
Dacheng Tao
AAML
33
26
0
06 Jun 2024
Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing
Hadi Askari
Anshuman Chhabra
Muhao Chen
Prasant Mohapatra
33
4
0
06 Jun 2024
Ranking Manipulation for Conversational Search Engines
Samuel Pfrommer
Yatong Bai
Tanmay Gautam
Somayeh Sojoudi
SILM
39
4
0
05 Jun 2024
Previous
1
2
3
...
6
7
8
...
11
12
13
Next