v1v2v3v4 (latest)

Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning

28 May 2024

Tiansheng Huang

Sihao Hu

Fatih Ilhan

Selim Furkan Tekin

Ling Liu

ArXiv (abs)PDF HTML Github (21★)

Papers citing "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning"

23 / 23 papers shown

A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

341

16 Oct 2025

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

241

26 Sep 2025

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

194

17 Aug 2025

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language ModelsInternational Conference on Learning Representations (ICLR), 2025

384

19 Jun 2025

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

404

10 Jun 2025

Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning

525

04 Jun 2025

SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA

339

29 May 2025

Unveiling the Basin-Like Loss Landscape in Large Language Models

665

23 May 2025

Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study

517

20 May 2025

Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets

375

17 May 2025

Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety

465

11 May 2025

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

597

30 Jan 2025

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesNeural Information Processing Systems (NeurIPS), 2024

495

101

20 Jan 2025

SaLoRA: Safety-Alignment Preserved Low-Rank AdaptationInternational Conference on Learning Representations (ICLR), 2025

491

03 Jan 2025

H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

466

26 Nov 2024

JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit

560

17 Nov 2024

Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning AttacksNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

532

23 Oct 2024

Understanding Forgetting in LLM Supervised Fine-Tuning and Preference Learning - A Convex Optimization Perspective

586

20 Oct 2024

Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise PerturbationIEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024

563

13 Oct 2024

Recent Advances in Attack and Defense Approaches of Large Language Models

397

05 Sep 2024

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

735

18 Aug 2024

Tamper-Resistant Safeguards for Open-Weight LLMsInternational Conference on Learning Representations (ICLR), 2024

Andy Zhou

...

537

123

01 Aug 2024

A Survey on Large Language Model-Based Game Agents

AI4CE LLMAG LM&Ro LM&MA

873

120

02 Apr 2024