Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.14968
Cited By
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
22 February 2024
Jiong Wang
Jiazhao Li
Yiquan Li
Xiangyu Qi
Junjie Hu
Yixuan Li
P. McDaniel
Muhao Chen
Bo Li
Chaowei Xiao
AAML
SILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment"
16 / 16 papers shown
Title
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Aladin Djuhera
S. Kadhe
Farhan Ahmed
Syed Zawad
Holger Boche
MoMe
46
0
0
21 Mar 2025
Layer-Aware Task Arithmetic: Disentangling Task-Specific and Instruction-Following Knowledge
Yan-Lun Chen
Yi-Ru Wei
Chia-Yi Hsu
Chia-Mu Yu
Chun-ying Huang
Ying-Dar Lin
Yu-Sung Wu
Wei-Bin Lee
MoMe
KELM
48
0
0
27 Feb 2025
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Zixuan Weng
Xiaolong Jin
Jinyuan Jia
X. Zhang
AAML
38
0
0
27 Feb 2025
BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack
Zihao Zhu
Hongbao Zhang
Mingda Zhang
Ruotong Wang
Guanzong Wu
Ke Xu
Baoyuan Wu
AAML
LRM
46
4
0
16 Feb 2025
Process Reward Model with Q-Value Rankings
W. Li
Yixuan Li
LRM
39
13
0
15 Oct 2024
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Pankayaraj Pathmanathan
Udari Madhushani Sehwag
Michael-Andrei Panaitescu-Liess
Furong Huang
SILM
AAML
35
0
0
15 Oct 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
29
10
0
13 Jun 2024
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
Fan Liu
Zhao Xu
Hao Liu
AAML
37
9
0
07 Jun 2024
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Chia-Yi Hsu
Yu-Lin Tsai
Chih-Hsun Lin
Pin-Yu Chen
Chia-Mu Yu
Chun-ying Huang
31
30
0
27 May 2024
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
Chak Tou Leong
Yi Cheng
Kaishuai Xu
Jian Wang
Hanlin Wang
Wenjie Li
AAML
41
17
0
25 May 2024
ARGS: Alignment as Reward-Guided Search
Maxim Khanov
Jirayu Burapacheep
Yixuan Li
20
43
0
23 Jan 2024
Hijacking Large Language Models via Adversarial In-Context Learning
Yao Qiang
Xiangyu Zhou
Dongxiao Zhu
25
27
0
16 Nov 2023
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALM
7
80
0
31 Oct 2023
Privacy in Large Language Models: Attacks, Defenses and Future Directions
Haoran Li
Yulin Chen
Jinglong Luo
Yan Kang
Xiaojin Zhang
Qi Hu
Chunkit Chan
Yangqiu Song
PILM
30
39
0
16 Oct 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
197
2,953
0
22 Mar 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
1