Aligning Proteins and Language: A Foundation Model for Protein Retrieval
- 3DV
This paper aims to retrieve proteins with similar structures and semantics from large-scale protein dataset, facilitating the functional interpretation of protein structures derived by structural determination methods like cryo-Electron Microscopy (cryo-EM). Motivated by the recent progress of vision-language models (VLMs), we propose a CLIP-style framework for aligning 3D protein structures with functional annotations using contrastive learning. For model training, we propose a large-scale dataset of approximately 200,000 protein-caption pairs with rich functional descriptors. We evaluate our model in both in-domain and more challenging cross-database retrieval on Protein Data Bank (PDB) and Electron Microscopy Data Bank (EMDB) dataset, respectively. In both cases, our approach demonstrates promising zero-shot retrieval performance, highlighting the potential of multimodal foundation models for structure-function understanding in protein biology.
View on arXiv@article{wu2025_2506.08023, title={ Aligning Proteins and Language: A Foundation Model for Protein Retrieval }, author={ Qifeng Wu and Zhengzhe Liu and Han Zhu and Yizhou Zhao and Daisuke Kihara and Min Xu }, journal={arXiv preprint arXiv:2506.08023}, year={ 2025 } }