307

Prediction with a Short Memory

Abstract

We consider the problem of predicting the next observation given a sequence of past observations. We show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by II, then a simple Markov model over the most recent I/ϵI/\epsilon observations can obtain KL error ϵ\epsilon with respect to the optimal predictor with access to the entire past. For a Hidden Markov Model with nn states, II is bounded by logn\log n, a quantity that does not depend on the mixing time. We also demonstrate that the simple Markov model cannot really be improved upon: First, a window length of I/ϵI/\epsilon (I/ϵ2I/\epsilon^2) is information-theoretically necessary for KL error (1\ell_1 error). Second, the dΘ(I/ϵ)d^{\Theta(I/\epsilon)} samples required to accurately estimate the Markov model when observations are drawn from an alphabet of size dd is in fact necessary for any computationally tractable algorithm, assuming the hardness of strongly refuting a certain class of CSPs.

View on arXiv
Comments on this paper