Estimating the number of unseen species: How far can one foresee?

Population estimation is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher, uses samples to estimate , the number of hitherto unseen elements that will be observed among new samples. A clear benchmark question is for how large an can be estimated well. In seminal works, Good and Toulmin constructed an intriguing estimator that approximates for all , and Efron and Thisted showed empirically that a variation of this estimator approximates even for some ; however, no theoretical guarantees are known. We show that a simple modification of the estimator can accurately predict with a normalized mean-squared error for . The bound applies to any and without any hidden constants, and the algorithm is a simple linear estimator, making it particularly suitable for applications. We also show that no estimator can approximate for beyond .
View on arXiv