14
9

Revisiting Prioritized Experience Replay: A Value Perspective

Abstract

Experience replay enables off-policy reinforcement learning (RL) agents to utilize past experiences to maximize the cumulative reward. Prioritized experience replay that weighs experiences by the magnitude of their temporal-difference error (TD|\text{TD}|) significantly improves the learning efficiency. But how TD|\text{TD}| is related to the importance of experience is not well understood. We address this problem from an economic perspective, by linking TD|\text{TD}| to value of experience, which is defined as the value added to the cumulative reward by accessing the experience. We theoretically show the value metrics of experience are upper-bounded by TD|\text{TD}| for Q-learning. Furthermore, we successfully extend our theoretical framework to maximum-entropy RL by deriving the lower and upper bounds of these value metrics for soft Q-learning, which turn out to be the product of TD|\text{TD}| and "on-policyness" of the experiences. Our framework links two important quantities in RL: TD|\text{TD}| and value of experience. We empirically show that the bounds hold in practice, and experience replay using the upper bound as priority improves maximum-entropy RL in Atari games.

View on arXiv
Comments on this paper