208

META-Learning State-based λ for More Sample-Efficient Policy Evaluation

Abstract

To improve the sample efficiency of temporal difference learning, we propose a meta-learning method for adjusting the eligibility trace parameter, {\lambda}, in a state-dependent manner. Our approach can be used both in on-policy and off-policy learning. The adaptation of {\lambda} is achieved with the help of auxiliary learners that learn the distributional information about the update targets online, incurring the same level of cost per step as the value learner. We prove, under some assumptions, that the proposed method improves the overall quality of the update targets by minimizing the overall target error. This method can be viewed as a plugin which can also be used to assist prediction with function approximation by meta-learning feature (observation)-based {\lambda}s online or even in the control cases to assist policy improvement. In the experiments, we observe significant performance improvement for these scenarios as well as improved robustness of the proposed algorithm to learning rate variations.

View on arXiv
Comments on this paper