312

Prediction with a Short Memory

Abstract

We consider the problem of predicting the next observation given a sequence of past observations. We show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by II, then a simple Markov model over the most recent I/ϵI/\epsilon observations obtains expected KL error ϵ\epsilon--and hence 1\ell_1 error ϵ\sqrt{\epsilon}--with respect to the optimal predictor that has access to the entire past. For a Hidden Markov Model with nn states, II is bounded by logn\log n, a quantity that does not depend on the mixing time. We also establish that this result cannot be improved upon, in the following senses: First, a window length of I/ϵI/\epsilon is information-theoretically necessary for expected KL error ϵ\epsilon, or 1\ell_1 error ϵ\sqrt{\epsilon}. Second, the dΘ(I/ϵ)d^{\Theta(I/\epsilon)} samples required to accurately estimate the Markov model when observations are drawn from an alphabet of size dd is necessary for any computationally tractable learning/prediction algorithm, assuming the hardness of strongly refuting a certain class of CSPs.

View on arXiv
Comments on this paper