54
v1v2 (latest)

Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop

Elizabeth Fahsbender
Alma Andersson
Jeremy Ash
Polina Binder
Daniel Burkhardt
Benjamin Chang
Georg K. Gerber
Anthony Gitter
Patrick Godau
Ankit Gupta
Genevieve Haliburton
Siyu He
Trey Ideker
Ivana Jelic
Aly Khan
Yang-Joon Kim
Aditi Krishnapriyan
Jon M. Laurent
Tianyu Liu
Emma Lundberg
Shalin B. Mehta
Rob Moccia
Angela Oliveira Pisco
Katherine S. Pollard
Suresh Ramani
Julio Saez-Rodriguez
Yasin Senbabaoglu
Elana Simon
Srinivasan Sivanandan
Gustavo Stolovitzky
Marc Valer
Bo Wang
Xikun Zhang
James Zou
Katrina Kalantar
Main:10 Pages
Abstract

Artificial intelligence holds immense promise for transforming biology, yet a lack of standardized, cross domain, benchmarks undermines our ability to build robust, trustworthy models. Here, we present insights from a recent workshop that convened machine learning and computational biology experts across imaging, transcriptomics, proteomics, and genomics to tackle this gap. We identify major technical and systemic bottlenecks such as data heterogeneity and noise, reproducibility challenges, biases, and the fragmented ecosystem of publicly available resources and propose a set of recommendations for building benchmarking frameworks that can efficiently compare ML models of biological systems across tasks and data modalities. By promoting high quality data curation, standardized tooling, comprehensive evaluation metrics, and open, collaborative platforms, we aim to accelerate the development of robust benchmarks for AI driven Virtual Cells. These benchmarks are crucial for ensuring rigor, reproducibility, and biological relevance, and will ultimately advance the field toward integrated models that drive new discoveries, therapeutic insights, and a deeper understanding of cellular systems.

View on arXiv
Comments on this paper