v1v2 (latest)

Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop

14 July 2025

Elizabeth Fahsbender

Alma Andersson

Jeremy Ash

Polina Binder

Daniel Burkhardt

Benjamin Chang

Georg K. Gerber

Anthony Gitter

Patrick Godau

Ankit Gupta

Genevieve Haliburton

Siyu He

Trey Ideker

Ivana Jelic

Aly Khan

Yang-Joon Kim

Aditi Krishnapriyan

Jon M. Laurent

Tianyu Liu

Emma Lundberg

Shalin B. Mehta

Rob Moccia

Angela Oliveira Pisco

Katherine S. Pollard

Suresh Ramani

Julio Saez-Rodriguez

Yasin Senbabaoglu

Elana Simon

Srinivasan Sivanandan

Gustavo Stolovitzky

Marc Valer

Bo Wang

Xikun Zhang

James Zou

Katrina Kalantar

VLM

ArXiv (abs)PDF HTML

Main:10 Pages

Abstract

Artificial intelligence holds immense promise for transforming biology, yet a lack of standardized, cross domain, benchmarks undermines our ability to build robust, trustworthy models. Here, we present insights from a recent workshop that convened machine learning and computational biology experts across imaging, transcriptomics, proteomics, and genomics to tackle this gap. We identify major technical and systemic bottlenecks such as data heterogeneity and noise, reproducibility challenges, biases, and the fragmented ecosystem of publicly available resources and propose a set of recommendations for building benchmarking frameworks that can efficiently compare ML models of biological systems across tasks and data modalities. By promoting high quality data curation, standardized tooling, comprehensive evaluation metrics, and open, collaborative platforms, we aim to accelerate the development of robust benchmarks for AI driven Virtual Cells. These benchmarks are crucial for ensuring rigor, reproducibility, and biological relevance, and will ultimately advance the field toward integrated models that drive new discoveries, therapeutic insights, and a deeper understanding of cellular systems.

View on arXiv

Comments on this paper