Rescaling and other forms of unsupervised preprocessing introduce bias into cross-validation

25 January 2019

Abstract

Cross-validation is the de-facto standard for model evaluation and selection. In proper use, it provides an unbiased estimate of a model's predictive performance. However, data sets often undergo various forms of preprocessing, such as mean-centering, rescaling, dimensionality reduction and outlier removal, prior to cross-validation. It is widely believed that such preprocessing stages, if done in an unsupervised manner that does not involve the class labels or response values, has no effect on the validity of cross-validation. In this paper, we show that this belief is not true. Preliminary unsupervised preprocessing can introduce either a positive or negative bias into the estimates of model performance. Thus, it may lead to invalid inference and sub-optimal choices of model parameters. In light of this, the scientific community should re-examine the use of preprocessing prior to cross-validation across the various application domains. By default, the parameters of all data-dependent transformations should be learned only from the training samples.

View on arXiv

Comments on this paper