102
16

P-values for classification

Abstract

Let (X,Y)(X,Y) be a random variable consisting of an observed feature vector XXX\in \mathcal{X} and an unobserved class label Y{1,2,...,L}Y\in \{1,2,...,L\} with unknown joint distribution. In addition, let D\mathcal{D} be a training data set consisting of nn completely observed independent copies of (X,Y)(X,Y). Usual classification procedures provide point predictors (classifiers) Y^(X,D)\widehat{Y}(X,\mathcal{D}) of YY or estimate the conditional distribution of YY given XX. In order to quantify the certainty of classifying XX we propose to construct for each θ=1,2,...,L\theta =1,2,...,L a p-value πθ(X,D)\pi_{\theta}(X,\mathcal{D}) for the null hypothesis that Y=θY=\theta, treating YY temporarily as a fixed parameter. In other words, the point predictor Y^(X,D)\widehat{Y}(X,\mathcal{D}) is replaced with a prediction region for YY with a certain confidence. We argue that (i) this approach is advantageous over traditional approaches and (ii) any reasonable classifier can be modified to yield nonparametric p-values. We discuss issues such as optimality, single use and multiple use validity, as well as computational and graphical aspects.

View on arXiv
Comments on this paper