101
111
v1v2 (latest)

Chebyshev polynomials, moment matching, and optimal estimation of the unseen

Pengkun Yang
Abstract

We consider the problem of estimating the support size of a discrete distribution whose minimum non-zero mass is at least 1k \frac{1}{k}. Under the independent sampling model, we show that the sample complexity, i.e., the minimal sample size to achieve an additive error of ϵk\epsilon k with probability at least 0.1 is within universal constant factors of klogklog21ϵ \frac{k}{\log k}\log^2\frac{1}{\epsilon} , which improves the state-of-the-art result of kϵ2logk \frac{k}{\epsilon^2 \log k} in \cite{VV13}. Similar characterization of the minimax risk is also obtained. Our procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties, which can be evaluated in O(n+log2k)O(n+\log^2 k) time and attains the sample complexity within a factor of six asymptotically. The superiority of the proposed estimator in terms of accuracy, computational efficiency and scalability is demonstrated in a variety of synthetic and real datasets.

View on arXiv
Comments on this paper