Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.06255
Cited By
Fight Back Against Jailbreaking via Prompt Adversarial Tuning
9 February 2024
Yichuan Mo
Yuji Wang
Zeming Wei
Yisen Wang
AAML
SILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Fight Back Against Jailbreaking via Prompt Adversarial Tuning"
5 / 5 papers shown
Title
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Zixuan Weng
Xiaolong Jin
Jinyuan Jia
X. Zhang
AAML
38
0
0
27 Feb 2025
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Xunguang Wang
Daoyuan Wu
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Shuai Wang
Yingjiu Li
Yang Liu
Ning Liu
Juergen Rahmel
AAML
54
6
0
08 Jun 2024
Towards Building a Robust Toxicity Predictor
Dmitriy Bespalov
Sourav S. Bhabesh
Yi Xiang
Liutong Zhou
Yanjun Qi
AAML
90
10
0
09 Apr 2024
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
Erfan Shayegani
Md Abdullah Al Mamun
Yu Fu
Pedram Zaree
Yue Dong
Nael B. Abu-Ghazaleh
AAML
135
139
0
16 Oct 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
1