Quantifying Knowledge Distillation Using Partial Information Decomposition

Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher's representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.
View on arXiv@article{dissanayake2025_2411.07483, title={ Quantifying Knowledge Distillation Using Partial Information Decomposition }, author={ Pasan Dissanayake and Faisal Hamman and Barproda Halder and Ilia Sucholutsky and Qiuyi Zhang and Sanghamitra Dutta }, journal={arXiv preprint arXiv:2411.07483}, year={ 2025 } }