Data Selection for ERMs

Annual Conference Computational Learning Theory (COLT), 2025

20 April 2025

ArXiv (abs)PDF HTML Github

Main:29 Pages

2 Figures

Bibliography:3 Pages

Abstract

Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule $\mathcal{A}$ and a data selection budget $n$ , how well can $\mathcal{A}$ perform when trained on at most $n$ data points selected from a population of $N$ points? We investigate when it is possible to select $n \ll N$ points and achieve performance comparable to training on the entire population.

View on arXiv

Comments on this paper