20
0

Coresets for Multiple p\ell_p Regression

Abstract

A coreset of a dataset with nn examples and dd features is a weighted subset of examples that is sufficient for solving downstream data analytic tasks. Nearly optimal constructions of coresets for least squares and p\ell_p linear regression with a single response are known in prior work. However, for multiple p\ell_p regression where there can be mm responses, there are no known constructions with size sublinear in mm. In this work, we construct coresets of size O~(ε2d)\tilde O(\varepsilon^{-2}d) for p<2p<2 and O~(εpdp/2)\tilde O(\varepsilon^{-p}d^{p/2}) for p>2p>2 independently of mm (i.e., dimension-free) that approximate the multiple p\ell_p regression objective at every point in the domain up to (1±ε)(1\pm\varepsilon) relative error. If we only need to preserve the minimizer subject to a subspace constraint, we improve these bounds by an ε\varepsilon factor for all p>1p>1. All of our bounds are nearly tight. We give two application of our results. First, we settle the number of uniform samples needed to approximate p\ell_p Euclidean power means up to a (1+ε)(1+\varepsilon) factor, showing that Θ~(ε2)\tilde\Theta(\varepsilon^{-2}) samples for p=1p = 1, Θ~(ε1)\tilde\Theta(\varepsilon^{-1}) samples for 1<p<21 < p < 2, and Θ~(ε1p)\tilde\Theta(\varepsilon^{1-p}) samples for p>2p>2 is tight, answering a question of Cohen-Addad, Saulpic, and Schwiegelshohn. Second, we show that for 1<p<21<p<2, every matrix has a subset of O~(ε1k)\tilde O(\varepsilon^{-1}k) rows which spans a (1+ε)(1+\varepsilon)-approximately optimal kk-dimensional subspace for p\ell_p subspace approximation, which is also nearly optimal.

View on arXiv
Comments on this paper