v1v2 (latest)

Convergence of Fast Policy Iteration in Markov Games and Robust MDPs

8 August 2025

Keith Badger

Marek Petrik

Jefferson Huang

ArXiv (abs)PDF HTML Github

Main:7 Pages

8 Figures

Bibliography:1 Pages

1 Tables

Appendix:6 Pages

Abstract

Markov games and robust MDPs are closely related models that involve computing a pair of saddle point policies. As part of the long-standing effort to develop efficient algorithms for these models, the Filar-Tolwinski (FT) algorithm has shown considerable promise. As our first contribution, we demonstrate that FT may fail to converge to a saddle point and may loop indefinitely, even in small games. This observation contradicts the proof of FT's convergence to a saddle point in the original paper. As our second contribution, we propose Residual Conditioned Policy Iteration (RCPI). RCPI builds on FT, but is guaranteed to converge to a saddle point. Our numerical results show that RCPI outperforms other convergent algorithms by several orders of magnitude.

View on arXiv

Comments on this paper