ClinKD: Cross-Modal Clinical Knowledge Distiller For Multi-Task Medical Images

Medical Visual Question Answering (Med-VQA) represents a critical and challenging subtask within the general VQA domain. Despite significant advancements in general Visual Question Answering (VQA), multimodal large language models (MLLMs) still exhibit substantial limitations when handling multi-task VQA scenarios. These limitations manifest through erroneous spatial localization and misinterpretation of medical images, which primarily arise from two fundamental issues: inadequate image-text alignment and insufficient medical knowledge in general-purpose MLLMs for specialized medical applications. To address these issues, we introduce the Cross-Modal Clinical Knowledge Distiller (ClinKD), an innovative framework designed to enhance image-text alignment and establish more effective medical knowledge adaptation mechanisms, which enables MLLMs to adapt to medical knowledge. Our extensive experimental evaluations demonstrate that the ClinKD achieves state-of-the-art performance on the Med-GRIT-270k dataset, a challenging medical benchmark containing fine-grained multi-task QA pairs. The results indicate that our approach not only significantly improves image-text alignment but also effectively enables MLLMs to adapt to the medical knowledge. The source code for ClinKD is available at:this https URL.
View on arXiv@article{ge2025_2502.05928, title={ ClinKD: Cross-Modal Clinical Knowledge Distiller For Multi-Task Medical Images }, author={ Hongyu Ge and Longkun Hao and Zihui Xu and Zhenxin Lin and Bin Li and Shoujun Zhou and Hongjin Zhao and Yihang Liu }, journal={arXiv preprint arXiv:2502.05928}, year={ 2025 } }