573

On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

IEEE Transactions on Information Theory (IEEE Trans. Inf. Theory), 2008
Abstract

The article presents a new interpretation for Zipf's law in natural language which relies on two areas of information theory. We reformulate the problem of grammar-based compression and investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If an nn-letter long text describes nβn^\beta independent facts in a random but consistent way then the text contains at least nβ/lognn^\beta/\log n different words. In the formal statement, two specific postulates are adopted. Firstly, the words are understood as the nonterminal symbols of the shortest grammar-based encoding of the text. Secondly, the texts are assumed to be emitted by a nonergodic source, with the described facts being binary IID variables that are asymptotically predictable in a shift-invariant way. The linguistic relevance of presented modeling assumptions, theorems, definitions, and examples is discussed in parallel.

View on arXiv
Comments on this paper