Distribution-free tests for lossless feature selection in classification and regression

We study the problem of lossless feature selection for a -dimensional feature vector and label for binary classification as well as nonparametric regression. For an index set , consider the selected -dimensional feature subvector . If and stand for the minimum risk based on and , respectively, then is called lossless if . For classification, the minimum risk is the Bayes error probability, while in regression, the minimum risk is the residual variance. We introduce nearest-neighbor based test statistics to test the hypothesis that is lossless. For the threshold , the corresponding tests are proved to be consistent under conditions on the distribution of that are significantly milder than in previous work. Also, our threshold is dimension-independent, in contrast to earlier methods where for large the threshold becomes too large to be useful in practice.
View on arXiv