33
0

Simple Semi-supervised Knowledge Distillation from Vision-Language Models via D\mathbf{\texttt{D}}ual-H\mathbf{\texttt{H}}ead O\mathbf{\texttt{O}}ptimization

Abstract

Vision-language models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data. However, deploying such large models remains challenging, particularly in resource-constrained environments. Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning, increasing computational overhead and optimization complexity. In this paper, we propose D\mathbf{\texttt{D}}ual-H\mathbf{\texttt{H}}ead O\mathbf{\texttt{O}}ptimization (DHO\mathbf{\texttt{DHO}}) -- a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-supervised settings. Specifically, we introduce dual prediction heads that independently learn from labeled data and teacher predictions, and propose to linearly combine their outputs during inference. We observe that DHO\texttt{DHO} mitigates gradient conflicts between supervised and distillation signals, enabling more effective feature learning than single-head KD baselines. As a result, extensive experiments show that DHO\texttt{DHO} consistently outperforms baselines across multiple domains and fine-grained datasets. Notably, on ImageNet, it achieves state-of-the-art performance, improving accuracy by 3% and 0.1% with 1% and 10% labeled data, respectively, while using fewer parameters.

View on arXiv
@article{kang2025_2505.07675,
  title={ Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization },
  author={ Seongjae Kang and Dong Bok Lee and Hyungjoon Jang and Sung Ju Hwang },
  journal={arXiv preprint arXiv:2505.07675},
  year={ 2025 }
}
Comments on this paper