459

On the Statistical Complexity of Estimating Vendi Scores from Empirical Data

Conference on Uncertainty in Artificial Intelligence (UAI), 2024
Main:8 Pages
17 Figures
Bibliography:4 Pages
3 Tables
Appendix:16 Pages
Abstract

Evaluating the diversity of generative models without access to reference data poses methodological challenges. The reference-free Vendi score offers a solution by quantifying the diversity of generated data using matrix-based entropy measures. The Vendi score is usually computed via the eigendecomposition of an n×nn \times n kernel matrix for nn generated samples. However, the heavy computational cost of eigendecomposition for large nn often limits the sample size used in practice to a few tens of thousands. In this paper, we investigate the statistical convergence of the Vendi score. We numerically demonstrate that for kernel functions with an infinite feature map dimension, the score estimated from a limited sample size may exhibit a non-negligible bias relative to the population Vendi score, i.e., the asymptotic limit as the sample size approaches infinity. To address this, we introduce a truncation of the Vendi statistic, called the tt-truncated Vendi statistic, which is guaranteed to converge to its asymptotic limit given n=O(t)n=O(t) samples. We show that the existing Nyström method and the FKEA approximation method for approximating the Vendi score both converge to the population truncated Vendi score. We perform several numerical experiments to illustrate the concentration of the Nyström and FKEA-computed Vendi scores around the truncated Vendi and discuss how the truncated Vendi score correlates with the diversity of image and text data.

View on arXiv
Comments on this paper