103
111

Chebyshev polynomials, moment matching, and optimal estimation of the unseen

Pengkun Yang
Abstract

We consider the problem of estimating the support size of a discrete distribution whose minimum non-zero mass is at least 1k \frac{1}{k}. Under the independent sampling model, we show that the minimax sample complexity to achieve an additive error of ϵk\epsilon k with probability at least 0.5 is within universal constant factors of klogklog21ϵ, \frac{k}{\log k}\log^2\frac{1}{\epsilon} , which improves the state-of-the-art result kϵ2logk \frac{k}{\epsilon^2 \log k} due to Valiant and Valiant. The optimal procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties. We also study the closely related species problem where the goal is to estimate the number of distinct colors in an urn containing kk balls from repeated draws. While achieving an additive error proportional to kk still requires Ω(klogk) \Omega(\frac{k}{\log k}) samples, we show that with Θ(k) \Theta(k) samples one can strictly outperform a general support size estimator using interpolating polynomials.

View on arXiv
Comments on this paper