ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.04524
45
3

Toward Secure Tuning: Mitigating Security Risks from Instruction Fine-Tuning

6 October 2024
Yanrui Du
Sendong Zhao
Jiawei Cao
Ming Ma
Danyang Zhao
Fenglei Fan
Fenglei Fan
Ting Liu
Bing Qin
ArXivPDFHTML
Abstract

Instruction fine-tuning has emerged as a critical technique for customizing Large Language Models (LLMs) to specific applications. However, recent studies have highlighted significant security vulnerabilities in fine-tuned LLMs. Existing defense efforts focus more on pre-training and post-training methods, yet there remains underexplored in in-training methods. To fill this gap, we introduce a novel secure-tuning strategy called SWAT. By analyzing how module-level parameters (e.g. Q/K/V/O) affect the security feature space drift, we identify a robust subset of modules, termed Mods_Rob. Our SWAT strategy begins by warming up Mods_Rob to capture low-level features with minimal security risks, followed by training all parameters to achieve optimal task performance. Essentially, this strategy shifts the early learning burden more from global parameters to Mods_Rob, reducing update magnitudes of the non-robust subset. Across various datasets, scenarios, and LLMs, our strategy has demonstrated significant success in mitigating security risks while preserving task performance. Importantly, it can be seamlessly integrated with pre-training and post-training methods, leading to greater improvements.

View on arXiv
@article{du2025_2410.04524,
  title={ Toward Secure Tuning: Mitigating Security Risks from Instruction Fine-Tuning },
  author={ Yanrui Du and Sendong Zhao and Jiawei Cao and Ming Ma and Danyang Zhao and Shuren Qi and Fenglei Fan and Ting Liu and Bing Qin },
  journal={arXiv preprint arXiv:2410.04524},
  year={ 2025 }
}
Comments on this paper