276
v1v2 (latest)

How Evaluation Choices Distort the Outcome of Generative Drug Discovery

Journal of Cheminformatics (J Cheminform), 2024
Main:11 Pages
17 Figures
Bibliography:5 Pages
3 Tables
Appendix:8 Pages
Abstract

"How to evaluate the de novo designs proposed by a generative model?" Despite the transformative potential of generative deep learning in drug discovery, this seemingly simple question has no clear answer. The absence of standardized guidelines challenges both the benchmarking of generative approaches and the selection of molecules for prospective studies. In this work, we take a fresh - critical and constructive - perspective on de novo design evaluation. By training chemical language models, we analyze approximately 1 billion molecule designs and discover principles consistent across different neural networks and datasets. We uncover a key confounder: the size of the generated molecular library significantly impacts evaluation outcomes, often leading to misleading model comparisons. We find increasing the number of designs as a remedy and propose new and compute-efficient metrics to compute at large-scale. We also identify critical pitfalls in commonly used metrics - such as uniqueness and distributional similarity - that can distort assessments of generative performance. To address these issues, we propose new and refined strategies for reliable model comparison and design evaluation. Furthermore, when examining molecule selection and sampling strategies, our findings reveal the constraints to diversify the generated libraries and draw new parallels and distinctions between deep learning and drug discovery. We anticipate our findings to help reshape evaluation pipelines in generative drug discovery, paving the way for more reliable and reproducible generative modeling approaches.

View on arXiv
Comments on this paper