48
2

Computationally Feasible Near-Optimal Subset Selection for Linear Regression under Measurement Constraints

Abstract

Computationally feasible and statistically near-optimal subset selection strategies are derived to select a small portion of design (data) points in a linear regression model y=Xβ+εy=X\beta+\varepsilon to reduce measurement cost and data efficiency. We consider two subset selection algorithms for estimating model coefficients β\beta: the first algorithm is a random subsampling based method that achieves optimal statistical performance with a small (1+ϵ)(1+\epsilon) relative factor under the with replacement model, and an O(logk)O(\log k) multiplicative factor under the without replacement model, with kk denoting the measurement budget. The second algorithm is fully deterministic and achieves (1+ϵ)(1+\epsilon) relative approximation under the without replacement model, at the cost of slightly worse dependency of kk on the number of variables (data dimension) in the linear regression model. Finally, we show how our method could be extended to the corresponding prediction problem and also remark on interpretable sampling (selection) of data points under random design frameworks.

View on arXiv
Comments on this paper