160

On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision Processes

Abstract

We consider infinite-horizon discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. We consider the algorithm Value Iteration and the sequence of policies π1,...,πk\pi_1,...,\pi_k it gen erates until some iteration kk. We provide performance bounds for non-stationary policies involving the last mm generated policies that reduce the state-of-the-art bound for the last stationary policy πk\pi_k by a factor 1γ1γm\frac{1-\gamma}{1-\gamma^m}. In other words, and contrary to a common intuition, we show that it may be much easier to find a non-stationary approximately-optimal policy than a stationary one.

View on arXiv
Comments on this paper