v1v2v3 (latest)

Learning where to learn: Training data distribution optimization for scientific machine learning

27 May 2025

ArXiv (abs)PDF HTML Github

Main:10 Pages

20 Figures

Bibliography:6 Pages

3 Tables

Appendix:34 Pages

Abstract

In scientific machine learning, models are routinely deployed with parameter values or boundary conditions far from those used in training. This paper studies the learning-where-to-learn problem of designing a training data distribution that minimizes average prediction error across a family of deployment regimes. A theoretical analysis shows how the training distribution shapes deployment accuracy. This motivates two adaptive algorithms based on bilevel or alternating optimization in the space of probability measures. Discretized implementations using parametric distribution classes or nonparametric particle-based gradient flows deliver optimized training distributions that outperform nonadaptive designs. Once trained, the resulting models exhibit improved sample complexity and robustness to distribution shift. This framework unlocks the potential of principled data acquisition for learning functions and solution operators of partial differential equations.

View on arXiv

Comments on this paper