Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

2 July 2025

Haonan Duan

Stephen Zhewen Lu

Caitlin Fiona Harrigan

Nishkrit Desai

Jiarui Lu

Michał Koziarski

Leonardo Cotta

Chris J. Maddison

ArXiv (abs)PDF HTML

Main:9 Pages

12 Figures

Bibliography:1 Pages

139 Tables

Appendix:51 Pages

Abstract

Designing experiments and result interpretations are core scientific competencies, particularly in biology, where researchers perturb complex systems to uncover the underlying systems. Recent efforts to evaluate the scientific capabilities of large language models (LLMs) fail to test these competencies because wet-lab experimentation is prohibitively expensive: in expertise, time and equipment. We introduce SciGym, a first-in-class benchmark that assesses LLMs' iterative experiment design and analysis abilities in open-ended scientific discovery tasks. SciGym overcomes the challenge of wet-lab costs by running a dry lab of biological systems. These models, encoded in Systems Biology Markup Language, are efficient for generating simulated data, making them ideal testbeds for experimentation on realistically complex systems. We evaluated six frontier LLMs on 137 small systems, and released a total of 350 systems. Our evaluation shows that while more capable models demonstrated superior performance, all models' performance declined significantly as system complexity increased, suggesting substantial room for improvement in the scientific capabilities of LLM agents.

View on arXiv

@article{duan2025_2507.02083,
  title={ Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab },
  author={ Haonan Duan and Stephen Zhewen Lu and Caitlin Fiona Harrigan and Nishkrit Desai and Jiarui Lu and Michał Koziarski and Leonardo Cotta and Chris J. Maddison },
  journal={arXiv preprint arXiv:2507.02083},
  year={ 2025 }
}

Comments on this paper