41
5

Types, Tokens, and Hapaxes: A New Heap's Law

Abstract

Heap's Law states that in a large enough text corpus, the number of types as a function of tokens grows as N=KMβN=KM^\beta for some free parameters K,βK,\beta. Much has been written about how this result and various generalizations can be derived from Zipf's Law. Here we derive from first principles a completely novel expression of the type-token curve and prove its superior accuracy on real text. This expression naturally generalizes to equally accurate estimates for counting hapaxes and higher nn-legomena.

View on arXiv
Comments on this paper