On the Statistical Complexity of Estimating Vendi Scores from Empirical Data

17 February 2025

Abstract

Evaluating the diversity of generative models without access to reference data poses methodological challenges. The reference-free Vendi score offers a solution by quantifying the diversity of generated data using matrix-based entropy measures. The Vendi score is usually computed via the eigendecomposition of an $n \times n$ kernel matrix for $n$ generated samples. However, the heavy computational cost of eigendecomposition for large $n$ often limits the sample size used in practice to a few tens of thousands. In this paper, we investigate the statistical convergence of the Vendi score. We numerically demonstrate that for kernel functions with an infinite feature map dimension, the score estimated from a limited sample size may exhibit a non-negligible bias relative to the population Vendi score, i.e., the asymptotic limit as the sample size approaches infinity. To address this, we introduce a truncation of the Vendi statistic, called the $t$ -truncated Vendi statistic, which is guaranteed to converge to its asymptotic limit given $n=O(t)$ samples. We show that the existing Nyström method and the FKEA approximation method for approximating the Vendi score both converge to the population truncated Vendi score. We perform several numerical experiments to illustrate the concentration of the Nyström and FKEA-computed Vendi scores around the truncated Vendi and discuss how the truncated Vendi score correlates with the diversity of image and text data.

View on arXiv

@article{ospanov2025_2410.21719,
  title={ On the Statistical Complexity of Estimating Vendi Scores from Empirical Data },
  author={ Azim Ospanov and Farzan Farnia },
  journal={arXiv preprint arXiv:2410.21719},
  year={ 2025 }
}

Comments on this paper