Can Distillation Mitigate Backdoor Attacks in Pre-trained Encoders?
Self-Supervised Learning (SSL) has become a prominent paradigm for pre-training encoders to learning general-purpose representations from unlabeled data and releasing them on third-party platforms for broad downstream deep learning tasks. However, SSL is vulnerable to backdoor attacks, where an adversary may train and distribute poisoned pre-training encoders to contaminate the downstream models. In this paper, we study a defense mechanism based on distillation against poisoned encoders in SSL. Traditionally, distillation transfers knowledge from a pre-trained teacher model to a student model, enabling the student to replicate or refine the teacher's learned representations. We repurpose distillation to extract benign knowledge and remove backdoors from a poisoned pre-trained encoder to produce a clean and reliable pre-trained model. We conduct extensive experiments to evaluate the effectiveness of distillation in mitigating backdoor attacks on pre-trained encoders. Based on two state-of-the-art backdoor attacks and four widely adopted image classification datasets, our results demonstrate that distillation reduces the attack success rate from 80.87% to 27.51%, with only a 6.35% drop in model accuracy. Furthermore, by comparing four teacher architectures, three student models, and six loss functions, we find that the distillation with fine-tuned teacher networks, warm-up-based student training, and attention-based distillation losses yield the best performance.
View on arXiv