73
12

Estimating Entropy of Distributions in Constant Space

Abstract

We consider the task of estimating the entropy of kk-ary distributions from samples in the streaming model, where space is limited. Our main contribution is an algorithm that requires O(klog(1/ε)2ε3)O\left(\frac{k \log (1/\varepsilon)^2}{\varepsilon^3}\right) samples and a constant O(1)O(1) memory words of space and outputs a ±ε\pm\varepsilon estimate of H(p)H(p). Without space limitations, the sample complexity has been established as S(k,ε)=Θ(kεlogk+log2kε2)S(k,\varepsilon)=\Theta\left(\frac k{\varepsilon\log k}+\frac{\log^2 k}{\varepsilon^2}\right), which is sub-linear in the domain size kk, and the current algorithms that achieve optimal sample complexity also require nearly-linear space in kk. Our algorithm partitions [0,1][0,1] into intervals and estimates the entropy contribution of probability values in each interval. The intervals are designed to trade off the bias and variance of these estimates.

View on arXiv
Comments on this paper