134
126

Efficient multivariate entropy estimation via kk-nearest neighbour distances

Abstract

Many statistical procedures, including goodness-of-fit tests and methods for independent component analysis, rely critically on the estimation of the entropy of a distribution. In this paper we study a generalisation of the entropy estimator originally proposed by \citet{Kozachenko:87}, based on the kk-nearest neighbour distances of a sample of nn independent and identically distributed random vectors in Rd\mathbb{R}^d. When d3d \geq 3, and under regularity conditions, we derive the leading term in the asymptotic expansion of the bias of the estimator, while when d2d \leq 2 we provide bounds on the bias (which is typically negligible in these instances). We also prove that, when d3d \leq 3 and provided k/log5nk/\log^5 n \rightarrow \infty, the estimator is efficient, in that it achieves the local asymptotic minimax lower bound; on the other hand, when d4d\geq 4, a non-trivial bias precludes its efficiency regardless of the choice of kk. In addition to the theoretical understanding provided, our results also have several methodological implications; in particular, they motivate the prewhitening of the data before applying the estimator, facilitate the construction of asymptotically valid confidence intervals of asymptotically minimal width, and suggest methods for bias reduction to obtain root-nn consistency in higher dimensions.

View on arXiv
Comments on this paper