Prediction with a Short Memory

8 December 2016

Willie Neiswanger

Abstract

We consider the problem of predicting the next observation given a sequence of past observations. We show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by $I$ , then a simple Markov model over the most recent $I/\epsilon$ observations obtains expected KL error $\epsilon$ --and hence $\ell_1$ error $\sqrt{\epsilon}$ --with respect to the optimal predictor that has access to the entire past. For a Hidden Markov Model with $n$ states, $I$ is bounded by $\log n$ , a quantity that does not depend on the mixing time. We also establish that this result cannot be improved upon, in the following senses: First, a window length of $I/\epsilon$ is information-theoretically necessary for expected KL error $\epsilon$ , or $\ell_1$ error $\sqrt{\epsilon}$ . Second, the $d^{\Theta(I/\epsilon)}$ samples required to accurately estimate the Markov model when observations are drawn from an alphabet of size $d$ is necessary for any computationally tractable learning/prediction algorithm, assuming the hardness of strongly refuting a certain class of CSPs.

View on arXiv

Comments on this paper