10
1

Distribution-free tests for lossless feature selection in classification and regression

Abstract

We study the problem of lossless feature selection for a dd-dimensional feature vector X=(X(1),,X(d))X=(X^{(1)},\dots ,X^{(d)}) and label YY for binary classification as well as nonparametric regression. For an index set S{1,,d}S\subset \{1,\dots ,d\}, consider the selected S|S|-dimensional feature subvector XS=(X(i),iS)X_S=(X^{(i)}, i\in S). If LL^* and L(S)L^*(S) stand for the minimum risk based on XX and XSX_S, respectively, then XSX_S is called lossless if L=L(S)L^*=L^*(S). For classification, the minimum risk is the Bayes error probability, while in regression, the minimum risk is the residual variance. We introduce nearest-neighbor based test statistics to test the hypothesis that XSX_S is lossless. For the threshold an=logn/na_n=\log n/\sqrt{n}, the corresponding tests are proved to be consistent under conditions on the distribution of (X,Y)(X,Y) that are significantly milder than in previous work. Also, our threshold is dimension-independent, in contrast to earlier methods where for large dd the threshold becomes too large to be useful in practice.

View on arXiv
Comments on this paper