Learning the value function of a given policy from data samples is an important problem in Reinforcement Learning. TD() is a popular class of algorithms to solve this problem. However, the weights assigned to different -step returns in TD(), controlled by the parameter , decrease exponentially with increasing . In this paper, we present a -schedule procedure that generalizes the TD() algorithm to the case when the parameter could vary with time-step. This allows flexibility in weight assignment, i.e., the user can specify the weights assigned to different -step returns by choosing a sequence . Based on this procedure, we propose an on-policy algorithm - TD()-schedule, and two off-policy algorithms - GTD()-schedule and TDC()-schedule, respectively. We provide proofs of almost sure convergence for all three algorithms under a general Markov noise framework.
View on arXiv