On the Use of Non-Stationary Policies for Infinite-Horizon Discounted
Markov Decision Processes
- OffRL
Abstract
We consider infinite-horizon discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. We consider the algorithm Value Iteration and the sequence of policies it gen erates until some iteration . We provide performance bounds for non-stationary policies involving the last generated policies that reduce the state-of-the-art bound for the last stationary policy by a factor . In other words, and contrary to a common intuition, we show that it may be much easier to find a non-stationary approximately-optimal policy than a stationary one.
View on arXivComments on this paper
