Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.18641
Cited By
Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning
28 May 2024
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning"
12 / 12 papers shown
Title
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora
ALM
70
41
0
20 Jan 2025
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Zhibo Wang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
36
3
0
17 Nov 2024
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
Samuele Poppi
Zheng-Xin Yong
Yifei He
Bobbie Chern
Han Zhao
Aobo Yang
Jianfeng Chi
AAML
31
11
0
23 Oct 2024
Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning
H. Fernando
Han Shen
Parikshit Ram
Yi Zhou
Horst Samulowitz
Nathalie Baracaldo
Tianyi Chen
CLL
44
2
0
20 Oct 2024
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation
Guozhi Liu
Weiwei Lin
Tiansheng Huang
Ruichao Mo
Qi Mu
Li Shen
AAML
42
9
0
13 Oct 2024
Immunization against harmful fine-tuning attacks
Domenic Rosati
Jan Wehner
Kai Williams
Lukasz Bartoszcze
Jan Batzner
Hassan Sajjad
Frank Rudzicz
AAML
28
15
0
26 Feb 2024
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei
Kaixuan Huang
Yangsibo Huang
Tinghao Xie
Xiangyu Qi
Mengzhou Xia
Prateek Mittal
Mengdi Wang
Peter Henderson
AAML
49
78
0
07 Feb 2024
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Yongshuo Zong
Ondrej Bohdal
Tingyang Yu
Yongxin Yang
Timothy M. Hospedales
VLM
MLLM
49
56
0
03 Feb 2024
Learning and Forgetting Unsafe Examples in Large Language Models
Jiachen Zhao
Zhun Deng
David Madras
James Zou
Mengye Ren
MU
KELM
CLL
61
15
0
20 Dec 2023
FedSpeed: Larger Local Interval, Less Communication Round, and Higher Generalization Accuracy
Yan Sun
Li Shen
Tiansheng Huang
Liang Ding
Dacheng Tao
FedML
27
50
0
21 Feb 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition
Hamed Karimi
J. Nutini
Mark W. Schmidt
114
1,190
0
16 Aug 2016
1