Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2501.17433
Cited By
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
29 January 2025
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation"
6 / 6 papers shown
Title
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets
Lei Hsiung
Tianyu Pang
Yung-Chen Tang
Linyue Song
Tsung-Yi Ho
Pin-Yu Chen
Yaoqing Yang
121
0
0
05 Jun 2025
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning
Biao Yi
Tiansheng Huang
Baolei Zhang
Tong Li
Lihai Nie
Zheli Liu
Li Shen
MU
AAML
98
0
0
22 May 2025
Shape it Up! Restoring LLM Safety during Finetuning
ShengYun Peng
Pin-Yu Chen
Jianfeng Chi
Seongmin Lee
Duen Horng Chau
66
0
0
22 May 2025
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Kaustubh Ponkshe
Shaan Shah
Raghav Singhal
Praneeth Vepakomma
126
0
0
20 May 2025
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Zihan Guan
Mengxuan Hu
Ronghang Zhu
Sheng Li
Anil Vullikanti
AAML
81
3
0
11 May 2025
A generative approach to LLM harmfulness detection with special red flag tokens
Sophie Xhonneux
David Dobre
Mehrnaz Mohfakhami
Leo Schwinn
Gauthier Gidel
184
2
0
22 Feb 2025
1