Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies

15 October 2024

Abstract

Recent progress in machine learning (ML) has made high-accuracy quantum chemistry (QC) calculations more accessible. Of particular interest are multifidelity machine learning (MFML) methods where training data from differing accuracies or fidelities are used. These methods usually employ a fixed scaling factor, $\gamma$ , to relate the number of training samples across different fidelities, which reflects the cost and assumed sparsity of the data. This study investigates the impact of modifying $\gamma$ on model efficiency and accuracy for the prediction of vertical excitation energies using the QeMFi benchmark dataset. Further, this work introduces QC compute time informed scaling factors, denoted as $\theta$ , that vary based on QC compute times at different fidelities. A novel error metric, error contours of MFML, is proposed to provide a comprehensive view of model error contributions from each fidelity. The results indicate that high model accuracy can be achieved with just 2 training samples at the target fidelity when a larger number of samples from lower fidelities are used. This is further illustrated through a novel concept, the $\Gamma$ -curve, which compares model error against the time-cost of generating training samples, demonstrating that multifidelity models can achieve high accuracy while minimizing training data costs.

View on arXiv

@article{vinod2025_2410.11392,
  title={ Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies },
  author={ Vivin Vinod and Peter Zaspel },
  journal={arXiv preprint arXiv:2410.11392},
  year={ 2025 }
}

Comments on this paper