Instruction fine-tuning has emerged as a critical technique for customizing Large Language Models (LLMs) to specific applications. However, recent studies have highlighted significant security vulnerabilities in fine-tuned LLMs. Existing defense efforts focus more on pre-training and post-training methods, yet there remains underexplored in in-training methods. To fill this gap, we introduce a novel secure-tuning strategy called SWAT. By analyzing how module-level parameters (e.g. Q/K/V/O) affect the security feature space drift, we identify a robust subset of modules, termed Mods_Rob. Our SWAT strategy begins by warming up Mods_Rob to capture low-level features with minimal security risks, followed by training all parameters to achieve optimal task performance. Essentially, this strategy shifts the early learning burden more from global parameters to Mods_Rob, reducing update magnitudes of the non-robust subset. Across various datasets, scenarios, and LLMs, our strategy has demonstrated significant success in mitigating security risks while preserving task performance. Importantly, it can be seamlessly integrated with pre-training and post-training methods, leading to greater improvements.
View on arXiv@article{du2025_2410.04524, title={ Toward Secure Tuning: Mitigating Security Risks from Instruction Fine-Tuning }, author={ Yanrui Du and Sendong Zhao and Jiawei Cao and Ming Ma and Danyang Zhao and Shuren Qi and Fenglei Fan and Ting Liu and Bing Qin }, journal={arXiv preprint arXiv:2410.04524}, year={ 2025 } }