A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.
View on arXiv@article{lozano2025_2503.22727, title={ A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI }, author={ Alejandro Lozano and Min Woo Sun and James Burgess and Jeffrey J. Nirschl and Christopher Polzak and Yuhui Zhang and Liangyu Chen and Jeffrey Gu and Ivan Lopez and Josiah Aklilu and Anita Rau and Austin Wolfgang Katzer and Collin Chiu and Orr Zohar and Xiaohan Wang and Alfred Seunghoon Song and Chiang Chia-Chun and Robert Tibshirani and Serena Yeung-Levy }, journal={arXiv preprint arXiv:2503.22727}, year={ 2025 } }