Active Subsampling for Measurement-Constrained M-Estimation of
Individualized Thresholds with High-Dimensional Data
In the measurement-constrained problems, despite the availability of large datasets, we may be only affordable to observe the labels on a small portion of the large dataset. This poses a critical question that which data points are most beneficial to label given a budget constraint. In this paper, we focus on the estimation of the optimal individualized threshold in a measurement-constrained M-estimation framework. Our goal is to estimate a high-dimensional parameter in a linear threshold for a continuous variable such that the discrepancy between whether exceeds the threshold and a binary outcome is minimized. We propose a novel -step active subsampling algorithm to estimate , which iteratively samples the most informative observations and solves a regularized M-estimator. The theoretical properties of our estimator demonstrate a phase transition phenomenon with respect to , the smoothness of the conditional density of given and . For , we show that the two-step algorithm yields an estimator with the parametric convergence rate in norm. The rate of our estimator is strictly faster than the minimax optimal rate with i.i.d. samples drawn from the population. For the other two scenarios and , the estimator from the two-step algorithm is sub-optimal. The former requires to run steps to attain the same parametric rate, whereas in the latter case only a near parametric rate can be obtained. Furthermore, we formulate a minimax framework for the measurement-constrained M-estimation problem and prove that our estimator is minimax rate optimal up to a logarithmic factor. Finally, we demonstrate the performance of our method in simulation studies and apply the method to analyze a large diabetes dataset.
View on arXiv