49
5

Estimating the number of unseen species: How far can one foresee?

Abstract

Population estimation is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher, uses nn samples to estimate Un(m)U_n(m), the number of hitherto unseen elements that will be observed among mm new samples. A clear benchmark question is for how large an mm can Un(m)U_n(m) be estimated well. In seminal works, Good and Toulmin constructed an intriguing estimator that approximates Un(m)U_n(m) for all mnm\le n, and Efron and Thisted showed empirically that a variation of this estimator approximates Un(m)U_n(m) even for some m>nm>n; however, no theoretical guarantees are known. We show that a simple modification of the estimator can accurately predict Un(m)U_n(m) with a normalized mean-squared error δ\delta for m=nlognlog(3/δ)m= \frac{n\log n}{\log(3/\delta)}. The bound applies to any nn and m7m\ge 7 without any hidden constants, and the algorithm is a simple linear estimator, making it particularly suitable for applications. We also show that no estimator can approximate Un(m)U_n(m) for mm beyond O(nlogn)\mathcal{O}(n\log n).

View on arXiv
Comments on this paper