Predict Training Data Quality via Its Geometry in Metric Space

12 October 2025

Yang Ba

Mohammad Sadeq Abolhasani

Rong Pan

ArXiv (abs)PDF HTML

Main:6 Pages

4 Figures

Bibliography:2 Pages

1 Tables

Appendix:2 Pages

Abstract

High-quality training data is the foundation of machine learning and artificial intelligence, shaping how models learn and perform. Although much is known about what types of data are effective for training, the impact of the data's geometric structure on model performance remains largely underexplored. We propose that both the richness of representation and the elimination of redundancy within training data critically influence learning outcomes. To investigate this, we employ persistent homology to extract topological features from data within a metric space, thereby offering a principled way to quantify diversity beyond entropy-based measures. Our findings highlight persistent homology as a powerful tool for analyzing and enhancing the training data that drives AI systems.

View on arXiv

Comments on this paper