Optimal cross-validation in density estimation

5 November 2008

Abstract

The performance of cross-validation (CV) is analyzed in two contexts: (i) risk estimation and (ii) model selection in the density estimation framework. The main focus is given to one CV algorithm called leave- $p$ -out (Lpo), where $p$ denotes the cardinality of the test set. Closed-form expressions are settled for the Lpo estimator of the risk of projection estimators, which makes V-fold cross-validation completely useless. From a theoretical point of view, these closed-form expressions enable to study the Lpo performances in terms of risk estimation. For instance, the optimality of leave-one-out (Loo), that is Lpo with $p=1$ , is proved among CV procedures. Two model selection frameworks are also considered: estimation, as opposed to identification. Unlike risk estimation, Loo is proved to be suboptimal as a model selection procedure. In the estimation framework with finite sample size $n$ , optimality is achieved for $p$ large enough (with $p/n =o(1)$ ) to balance overfitting. A link is also identified between the optimal $p$ and the structure of the model collection. These theoretical results are strongly supported by simulation experiments. When performing identification, model consistency is also proved for Lpo with $p/n\to 1$ as $n\to +\infty$ .

View on arXiv

Comments on this paper