Revisit Policy Optimization in Matrix Form
Abstract
In tabular case, when the reward and environment dynamics are known, policy evaluation can be written as , where is the state transition matrix given policy and is the reward signal given . What annoys us is that and are both mixed with , which means every time when we update , they will change together. In this paper, we leverage the notation from \cite{wang2007dual} to disentangle and environment dynamics which makes optimization over policy more straightforward. We show that policy gradient theorem \cite{sutton2018reinforcement} and TRPO \cite{schulman2015trust} can be put into a more general framework and such notation has good potential to be extended to model-based reinforcement learning.
View on arXivComments on this paper
