Prediction with a Short Memory
- AI4TS

We consider the problem of predicting the next observation given a sequence of past observations. We show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by , then a simple Markov model over the most recent observations can obtain KL error with respect to the optimal predictor with access to the entire past. For a Hidden Markov Model with states, is bounded by , a quantity that does not depend on the mixing time. We also demonstrate that the simple Markov model cannot really be improved upon: First, a window length of () is information-theoretically necessary for KL error ( error). Second, the samples required to accurately estimate the Markov model when observations are drawn from an alphabet of size is in fact necessary for any computationally tractable algorithm, assuming the hardness of strongly refuting a certain class of CSPs.
View on arXiv