224

Active Subsampling for Measurement-Constrained M-Estimation of Individualized Thresholds with High-Dimensional Data

Main:27 Pages
12 Figures
Bibliography:4 Pages
2 Tables
Appendix:46 Pages
Abstract

In the measurement-constrained problems, despite the availability of large datasets, we may be only affordable to observe the labels on a small portion of the large dataset. This poses a critical question that which data points are most beneficial to label given a budget constraint. In this paper, we focus on the estimation of the optimal individualized threshold in a measurement-constrained M-estimation framework. Our goal is to estimate a high-dimensional parameter θ\theta in a linear threshold θTZ\theta^T Z for a continuous variable XX such that the discrepancy between whether XX exceeds the threshold θTZ\theta^T Z and a binary outcome YY is minimized. We propose a novel KK-step active subsampling algorithm to estimate θ\theta, which iteratively samples the most informative observations and solves a regularized M-estimator. The theoretical properties of our estimator demonstrate a phase transition phenomenon with respect to β1\beta\geq 1, the smoothness of the conditional density of XX given YY and ZZ. For β>(1+3)/2\beta>(1+\sqrt{3})/2, we show that the two-step algorithm yields an estimator with the parametric convergence rate Op((slogd/N)1/2)O_p((s \log d /N)^{1/2}) in l2l_2 norm. The rate of our estimator is strictly faster than the minimax optimal rate with NN i.i.d. samples drawn from the population. For the other two scenarios 1<β(1+3)/21<\beta\leq (1+\sqrt{3})/2 and β=1\beta=1, the estimator from the two-step algorithm is sub-optimal. The former requires to run K>2K>2 steps to attain the same parametric rate, whereas in the latter case only a near parametric rate can be obtained. Furthermore, we formulate a minimax framework for the measurement-constrained M-estimation problem and prove that our estimator is minimax rate optimal up to a logarithmic factor. Finally, we demonstrate the performance of our method in simulation studies and apply the method to analyze a large diabetes dataset.

View on arXiv
Comments on this paper