Estimating the number of unseen species: How far can one foresee?

23 November 2015

Abstract

Population estimation is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher, uses $n$ samples to estimate $U_n(m)$ , the number of hitherto unseen elements that will be observed among $m$ new samples. A clear benchmark question is for how large an $m$ can $U_n(m)$ be estimated well. In seminal works, Good and Toulmin constructed an intriguing estimator that approximates $U_n(m)$ for all $m\le n$ , and Efron and Thisted showed empirically that a variation of this estimator approximates $U_n(m)$ even for some $m>n$ ; however, no theoretical guarantees are known. We show that a simple modification of the estimator can accurately predict $U_n(m)$ with a normalized mean-squared error $\delta$ for $m= \frac{n\log n}{\log(3/\delta)}$ . The bound applies to any $n$ and $m\ge 7$ without any hidden constants, and the algorithm is a simple linear estimator, making it particularly suitable for applications. We also show that no estimator can approximate $U_n(m)$ for $m$ beyond $\mathcal{O}(n\log n)$ .

View on arXiv

Comments on this paper