Coresets for Multiple Regression

A coreset of a dataset with examples and features is a weighted subset of examples that is sufficient for solving downstream data analytic tasks. Nearly optimal constructions of coresets for least squares and linear regression with a single response are known in prior work. However, for multiple regression where there can be responses, there are no known constructions with size sublinear in . In this work, we construct coresets of size for and for independently of (i.e., dimension-free) that approximate the multiple regression objective at every point in the domain up to relative error. If we only need to preserve the minimizer subject to a subspace constraint, we improve these bounds by an factor for all . All of our bounds are nearly tight. We give two application of our results. First, we settle the number of uniform samples needed to approximate Euclidean power means up to a factor, showing that samples for , samples for , and samples for is tight, answering a question of Cohen-Addad, Saulpic, and Schwiegelshohn. Second, we show that for , every matrix has a subset of rows which spans a -approximately optimal -dimensional subspace for subspace approximation, which is also nearly optimal.
View on arXiv