Prediction with a Short Memory

8 December 2016

Willie Neiswanger

Abstract

We consider the problem of predicting the next observation given a sequence of past observations. We show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by $I$ , then a simple Markov model over the most recent $I/\epsilon$ observations can obtain KL error $\epsilon$ with respect to the optimal predictor with access to the entire past. For a Hidden Markov Model with $n$ states, $I$ is bounded by $\log n$ , a quantity that does not depend on the mixing time. We also demonstrate that the simple Markov model cannot really be improved upon: First, a window length of $I/\epsilon$ ( $I/\epsilon^2$ ) is information-theoretically necessary for KL error ( $\ell_1$ error). Second, the $d^{\Theta(I/\epsilon)}$ samples required to accurately estimate the Markov model when observations are drawn from an alphabet of size $d$ is in fact necessary for any computationally tractable algorithm, assuming the hardness of strongly refuting a certain class of CSPs.

View on arXiv

Comments on this paper