Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.20053
Cited By
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
28 June 2024
Danny Halawi
Alexander Wei
Eric Wallace
Tony T. Wang
Nika Haghtalab
Jacob Steinhardt
SILM
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation"
25 / 25 papers shown
Title
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Zihan Guan
Mengxuan Hu
Ronghang Zhu
Sheng R. Li
Anil Vullikanti
AAML
21
0
0
11 May 2025
Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents
Christian Schroeder de Witt
AAML
AI4CE
59
0
0
04 May 2025
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
Tomek Korbak
Mikita Balesni
Buck Shlegeris
Geoffrey Irving
ELM
27
1
0
07 Apr 2025
AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
Dillon Bowen
Ann-Kathrin Dombrowski
Adam Gleave
Chris Cundy
ELM
48
0
0
17 Mar 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MU
AAML
ELM
80
3
0
03 Feb 2025
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Y. Wang
Tiansheng Huang
Li Shen
H. Yao
Haotian Luo
Rui Liu
Naiqiang Tan
Jiaxing Huang
Dacheng Tao
AAML
MoMe
CLL
109
2
0
30 Jan 2025
Towards Data Governance of Frontier AI Models
Jason Hausenloy
Duncan McClements
Madhavendra Thakur
62
1
0
05 Dec 2024
DEAN: Deactivating the Coupled Neurons to Mitigate Fairness-Privacy Conflicts in Large Language Models
Chen Qian
Dongrui Liu
Jie Zhang
Yong Liu
Jing Shao
24
1
0
22 Oct 2024
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation
Guozhi Liu
Weiwei Lin
Tiansheng Huang
Ruichao Mo
Qi Mu
Li Shen
AAML
56
10
0
13 Oct 2024
PII-Scope: A Benchmark for Training Data PII Leakage Assessment in LLMs
K. K. Nakka
Ahmed Frikha
Ricardo Mendes
Xue Jiang
Xuebing Zhou
24
1
0
09 Oct 2024
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
Yohan Mathew
Ollie Matthews
Robert McCarthy
Joan Velja
Christian Schroeder de Witt
Dylan R. Cope
Nandi Schoots
19
3
0
02 Oct 2024
FlipAttack: Jailbreak LLMs via Flipping
Yue Liu
Xiaoxin He
Miao Xiong
Jinlan Fu
Shumin Deng
Bryan Hooi
AAML
29
12
0
02 Oct 2024
Endless Jailbreaks with Bijection Learning
Brian R. Y. Huang
Maximilian Li
Leonard Tang
AAML
70
5
0
02 Oct 2024
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
AAML
38
21
0
26 Sep 2024
Software Security Analysis in 2030 and Beyond: A Research Roadmap
Marcel Böhme
Eric Bodden
Tevfik Bultan
Cristian Cadar
Yang Liu
Giuseppe Scanniello
24
1
0
26 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
47
1
0
05 Sep 2024
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Nathaniel Li
Ziwen Han
Ian Steneker
Willow Primack
Riley Goodside
Hugh Zhang
Zifan Wang
Cristina Menghini
Summer Yue
AAML
MU
42
38
0
27 Aug 2024
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
Tiansheng Huang
Gautam Bhattacharya
Pratik Joshi
Josh Kimball
Ling Liu
AAML
MoMe
47
16
0
18 Aug 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
32
19
0
12 Jul 2024
Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
40
23
0
28 May 2024
Vaccine: Perturbation-aware Alignment for Large Language Model
Tiansheng Huang
Sihao Hu
Ling Liu
42
32
0
02 Feb 2024
Learning and Forgetting Unsafe Examples in Large Language Models
Jiachen Zhao
Zhun Deng
David Madras
James Zou
Mengye Ren
MU
KELM
CLL
76
15
0
20 Dec 2023
Poisoning Language Models During Instruction Tuning
Alexander Wan
Eric Wallace
Sheng Shen
Dan Klein
SILM
90
124
0
01 May 2023
Instruction Tuning with GPT-4
Baolin Peng
Chunyuan Li
Pengcheng He
Michel Galley
Jianfeng Gao
SyDa
ALM
LM&MA
157
576
0
06 Apr 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
303
11,730
0
04 Mar 2022
1