135
51

Unbiased estimates for linear regression via volume sampling

Abstract

For a full rank n×dn\times d matrix XX with ndn\ge d, consider the task of solving the linear least squares problem, where we try to predict a response value for each of the nn rows of XX. Assume that obtaining the responses is expensive and we can only afford to attain the responses for a small subset of rows. We show that a good approximate solution to this least squares problem can be obtained from just dimension dd many responses. Concretely, if the rows are in general position and if a subset of dd rows is chosen proportional to the squared volume spanned by those rows, then the expected total square loss (on all nn rows) of the least squares solution found for the subset is exactly d+1d+1 times the minimum achievable total loss. We provide lower bounds showing that the factor of d+1d+1 is optimal, and any iid row sampling procedure requires Ω(dlogd)\Omega(d\log d) responses to achieve a finite factor guarantee. Moreover, the least squares solution obtained for the volume sampled subproblem is an unbiased estimator of optimal solution based on all nn responses. Our methods lead to general matrix expectation formulas for volume sampling which go beyond linear regression. In particular, we propose a matrix estimator for the pseudoinverse X+X^+, computed from a small subset of rows of the matrix XX. The estimator is unbiased and surprisingly its covariance also has a closed form: It equals a specific factor times X+X+X^{+}X^{+\top}. We believe that these new formulas establish a fundamental connection between linear least squares and volume sampling. Our analysis for computing matrix expectations is based on reverse iterative volume sampling, a technique which also leads to a new algorithm for volume sampling that is by a factor of n2n^2 faster than the state-of-the-art.

View on arXiv
Comments on this paper