Computationally Feasible Near-Optimal Subset Selection for Linear Regression under Measurement Constraints

9 January 2016

Aarti Singh

Abstract

Computationally feasible and statistically near-optimal subset selection strategies are derived to select a small portion of design (data) points in a linear regression model $y=X\beta+\varepsilon$ to reduce measurement cost and data efficiency. We consider two subset selection algorithms for estimating model coefficients $\beta$ : the first algorithm is a random subsampling based method that achieves optimal statistical performance with a small $(1+\epsilon)$ relative factor under the with replacement model, and an $O(\log k)$ multiplicative factor under the without replacement model, with $k$ denoting the measurement budget. The second algorithm is fully deterministic and achieves $(1+\epsilon)$ relative approximation under the without replacement model, at the cost of slightly worse dependency of $k$ on the number of variables (data dimension) in the linear regression model. Finally, we show how our method could be extended to the corresponding prediction problem and also remark on interpretable sampling (selection) of data points under random design frameworks.

View on arXiv

Comments on this paper