61

Revisit Policy Optimization in Matrix Form

Abstract

In tabular case, when the reward and environment dynamics are known, policy evaluation can be written as Vπ=(IγPπ)1rπ\bm{V}_{\bm{\pi}} = (I - \gamma P_{\bm{\pi}})^{-1} \bm{r}_{\bm{\pi}}, where PπP_{\bm{\pi}} is the state transition matrix given policy π{\bm{\pi}} and rπ\bm{r}_{\bm{\pi}} is the reward signal given π{\bm{\pi}}. What annoys us is that PπP_{\bm{\pi}} and rπ\bm{r}_{\bm{\pi}} are both mixed with π{\bm{\pi}}, which means every time when we update π{\bm{\pi}}, they will change together. In this paper, we leverage the notation from \cite{wang2007dual} to disentangle π{\bm{\pi}} and environment dynamics which makes optimization over policy more straightforward. We show that policy gradient theorem \cite{sutton2018reinforcement} and TRPO \cite{schulman2015trust} can be put into a more general framework and such notation has good potential to be extended to model-based reinforcement learning.

View on arXiv
Comments on this paper