78
0

Advancing Medical Representation Learning Through High-Quality Data

Abstract

Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.

View on arXiv
@article{baghbanzadeh2025_2503.14377,
  title={ Advancing Medical Representation Learning Through High-Quality Data },
  author={ Negin Baghbanzadeh and Adibvafa Fallahpour and Yasaman Parhizkar and Franklin Ogidi and Shuvendu Roy and Sajad Ashkezari and Vahid Reza Khazaie and Michael Colacci and Ali Etemad and Arash Afkanpour and Elham Dolatabadi },
  journal={arXiv preprint arXiv:2503.14377},
  year={ 2025 }
}
Comments on this paper