Multi-Modal Foundation Models for Computational Pathology: A Survey

Foundation models have emerged as a powerful paradigm in computational pathology (CPath), enabling scalable and generalizable analysis of histopathological images. While early developments centered on uni-modal models trained solely on visual data, recent advances have highlighted the promise of multi-modal foundation models that integrate heterogeneous data sources such as textual reports, structured domain knowledge, and molecular profiles. In this survey, we provide a comprehensive and up-to-date review of multi-modal foundation models in CPath, with a particular focus on models built upon hematoxylin and eosin (H&E) stained whole slide images (WSIs) and tile-level representations. We categorize 32 state-of-the-art multi-modal foundation models into three major paradigms: vision-language, vision-knowledge graph, and vision-gene expression. We further divide vision-language models into non-LLM-based and LLM-based approaches. Additionally, we analyze 28 available multi-modal datasets tailored for pathology, grouped into image-text pairs, instruction datasets, and image-other modality pairs. Our survey also presents a taxonomy of downstream tasks, highlights training and evaluation strategies, and identifies key challenges and future directions. We aim for this survey to serve as a valuable resource for researchers and practitioners working at the intersection of pathology and AI.
View on arXiv@article{li2025_2503.09091, title={ Multi-Modal Foundation Models for Computational Pathology: A Survey }, author={ Dong Li and Guihong Wan and Xintao Wu and Xinyu Wu and Xiaohui Chen and Yi He and Christine G. Lian and Peter K. Sorger and Yevgeniy R. Semenov and Chen Zhao }, journal={arXiv preprint arXiv:2503.09091}, year={ 2025 } }