Although deep learning models owe their remarkable success to deep and complex architectures, this very complexity typically comes at the expense of real-time performance. To address this issue, a variety of model compression techniques have been proposed, among which knowledge distillation (KD) stands out for its strong empirical performance. The KD contains two concurrent processes: (i) matching the outputs of a large, pre-trained teacher network and a lightweight student network, and (ii) training the student to solve its designated downstream task. The associated loss functions are termed the distillation loss and the downsteam-task loss, respectively. Numerous prior studies report that KD is most effective when the influence of the distillation loss outweighs that of the downstream-task loss. The influence(or importance) is typically regulated by a balancing parameter. This paper provides a mathematical rationale showing that in a simple KD setting when the loss is decreasing, the balancing parameter should be dynamically adjusted
View on arXiv@article{kim2025_2505.06270, title={ Importance Analysis for Dynamic Control of Balancing Parameter in a Simple Knowledge Distillation Setting }, author={ Seongmin Kim and Kwanho Kim and Minseung Kim and Kanghyun Jo }, journal={arXiv preprint arXiv:2505.06270}, year={ 2025 } }