Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language ModelsInternational Conference on Learning Representations (ICLR), 2025 |
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesNeural Information Processing Systems (NeurIPS), 2024 |
SaLoRA: Safety-Alignment Preserved Low-Rank AdaptationInternational Conference on Learning Representations (ICLR), 2025 |
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning AttacksNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 |
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise PerturbationIEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024 |
Tamper-Resistant Safeguards for Open-Weight LLMsInternational Conference on Learning Representations (ICLR), 2024 |