This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating , for , by , where is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and -divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance () converges at rate in remarkable contrast to a typical rate for unsmoothed (and ). For the KL divergence, squared 2-Wasserstein distance (), and -divergence, the convergence rate is , but only if achieves finite input-output mutual information across the additive white Gaussian noise channel. If the latter condition is not met, the rate changes to for the KL divergence and , while the -divergence becomes infinite - a curious dichotomy. As a main application we consider estimating the differential entropy in the high-dimensional regime. The distribution is unknown but i.i.d samples from it are available. We first show that any good estimator of must have sample complexity that is exponential in . Using the empirical approximation results we then show that the absolute-error risk of the plug-in estimator converges at the parametric rate , thus establishing the minimax rate-optimality of the plug-in. Numerical results that demonstrate a significant empirical superiority of the plug-in approach to general-purpose differential entropy estimators are provided.
View on arXiv