ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.18641
  4. Cited By
Lazy Safety Alignment for Large Language Models against Harmful
  Fine-tuning
v1v2v3v4 (latest)

Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning

28 May 2024
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
ArXiv (abs)PDFHTMLGithub (21★)

Papers citing "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning"

23 / 23 papers shown
A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space
A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space
Bingjie Zhang
Yibo Yang
Renzhe
Dandan Guo
Jindong Gu
Philip Torr
Bernard Ghanem
341
3
0
16 Oct 2025
Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment
Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment
Jaehan Kim
Minkyoo Song
Seungwon Shin
Sooel Son
MoE
241
5
0
26 Sep 2025
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
Minseon Kim
Jin Myung Kwak
Lama Alssum
Bernard Ghanem
Juil Sock
David M. Krueger
Fazl Barez
Adel Bibi
194
9
0
17 Aug 2025
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language ModelsInternational Conference on Learning Representations (ICLR), 2025
Biao Yi
Tiansheng Huang
Sishuo Chen
Tong Li
Zheli Liu
Zhixuan Chu
Yiming Li
AAML
384
29
0
19 Jun 2025
AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin
Shuo Yang
Qihui Zhang
Yuyang Liu
Yue Huang
Xiaojun Jia
Kunpeng Ning
Jiayu Yao
Jigang Wang
Hailiang Dai
Yibing Song
404
18
0
10 Jun 2025
Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning
Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning
Liang Chen
Xueting Han
Li Shen
Jing Bai
Kam-Fai Wong
AAML
525
11
0
04 Jun 2025
SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA
SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA
Minrui Luo
Fuhang Kuang
Yu Wang
Zirui Liu
Tianxing He
CLL
339
1
0
29 May 2025
Unveiling the Basin-Like Loss Landscape in Large Language Models
Unveiling the Basin-Like Loss Landscape in Large Language Models
Huanran Chen
Yinpeng Dong
Zeming Wei
Yao Huang
Yichi Zhang
Hang Su
Jun Zhu
MoMeELM
665
5
0
23 May 2025
Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study
Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study
Kaustubh Ponkshe
Shaan Shah
Raghav Singhal
Praneeth Vepakomma
517
0
0
20 May 2025
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Ning Lu
Shengcai Liu
Jiahao Wu
Weiyu Chen
Zhirui Zhang
Yew-Soon Ong
Qi Wang
Ke Tang
375
20
0
17 May 2025
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Zihan Guan
Mengxuan Hu
Ronghang Zhu
Sheng Li
Anil Vullikanti
AAML
465
22
0
11 May 2025
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Yun Wang
Tiansheng Huang
Li Shen
Huanjin Yao
Haotian Luo
Rui Liu
Naiqiang Tan
Jiaxing Huang
Dacheng Tao
AAMLMoMeCLL
597
16
0
30 Jan 2025
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesNeural Information Processing Systems (NeurIPS), 2024
Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora
ALM
495
101
0
20 Jan 2025
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation
SaLoRA: Safety-Alignment Preserved Low-Rank AdaptationInternational Conference on Learning Representations (ICLR), 2025
Mingjie Li
Wai Man Si
Michael Backes
Yang Zhang
Yisen Wang
491
41
0
03 Jan 2025
H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs
H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs
Selim Furkan Tekin
Fatih Ilhan
Tiansheng Huang
Sihao Hu
Yichang Xu
Zachary Yahn
Ling Liu
MoMe
466
12
0
26 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Peng Kuang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
560
19
0
17 Nov 2024
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning AttacksNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Samuele Poppi
Zheng-Xin Yong
Yifei He
Bobbie Chern
Han Zhao
Aobo Yang
Jianfeng Chi
AAML
532
39
0
23 Oct 2024
Understanding Forgetting in LLM Supervised Fine-Tuning and Preference Learning - A Convex Optimization Perspective
Understanding Forgetting in LLM Supervised Fine-Tuning and Preference Learning - A Convex Optimization Perspective
H. Fernando
Han Shen
Parikshit Ram
Yi Zhou
Horst Samulowitz
Nathalie Baracaldo
Tianyi Chen
CLL
586
8
0
20 Oct 2024
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise PerturbationIEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024
Guozhi Liu
Weiwei Lin
Tiansheng Huang
Ruichao Mo
Qi Mu
Li Shen
AAML
563
37
0
13 Oct 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILMAAML
397
9
0
05 Sep 2024
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
Tiansheng Huang
Gautam Bhattacharya
Pratik Joshi
Josh Kimball
Ling Liu
AAMLMoMe
735
60
0
18 Aug 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Tamper-Resistant Safeguards for Open-Weight LLMsInternational Conference on Learning Representations (ICLR), 2024
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAMLMU
537
123
0
01 Aug 2024
A Survey on Large Language Model-Based Game Agents
A Survey on Large Language Model-Based Game Agents
Sihao Hu
Tiansheng Huang
Gaowen Liu
Ramana Rao Kompella
Gaowen Liu
Selim Furkan Tekin
Yichang Xu
Zachary Yahn
Ling Liu
AI4CELLMAGLM&RoLM&MA
873
120
0
02 Apr 2024
1
Page 1 of 1