134

Sparse Linear Regression is Easy on Random Supports

Main:24 Pages
Bibliography:3 Pages
Abstract

Sparse linear regression is one of the most basic questions in machine learning and statistics. Here, we are given as input a design matrix XRN×dX \in \mathbb{R}^{N \times d} and measurements or labels yRN{y} \in \mathbb{R}^N where y=Xw+ξ{y} = {X} {w}^* + {\xi}, and ξ{\xi} is the noise in the measurements. Importantly, we have the additional constraint that the unknown signal vector w{w}^* is sparse: it has kk non-zero entries where kk is much smaller than the ambient dimension. Our goal is to output a prediction vector w^\widehat{w} that has small prediction error: 1NXwXw^22\frac{1}{N}\cdot \|{X} {w}^* - {X} \widehat{w}\|^2_2.Information-theoretically, we know what is best possible in terms of measurements: under most natural noise distributions, we can get prediction error at most ϵ\epsilon with roughly N=O(klogd/ϵ)N = O(k \log d/\epsilon) samples. Computationally, this currently needs dΩ(k)d^{\Omega(k)} run-time. Alternately, with N=O(d)N = O(d), we can get polynomial-time. Thus, there is an exponential gap (in the dependence on dd) between the two and we do not know if it is possible to get do(k)d^{o(k)} run-time and o(d)o(d) samples.We give the first generic positive result for worst-case design matrices X{X}: For any X{X}, we show that if the support of w{w}^* is chosen at random, we can get prediction error ϵ\epsilon with N=poly(k,logd,1/ϵ)N = \text{poly}(k, \log d, 1/\epsilon) samples and run-time poly(d,N)\text{poly}(d,N). This run-time holds for any design matrix X{X} with condition number up to 2poly(d)2^{\text{poly}(d)}.Previously, such results were known for worst-case w{w}^*, but only for random design matrices from well-behaved families, matrices that have a very low condition number (poly(logd)\text{poly}(\log d); e.g., as studied in compressed sensing), or those with special structural properties.

View on arXiv
Comments on this paper