Learning the value function of a given policy from data samples is an important problem in Reinforcement Learning. TD($\lambda$) is a popular class of algorithms to solve this problem. However, the weights assigned to different $n$-step returns in TD($\lambda$), controlled by the parameter $\lambda$, decrease exponentially with increasing $n$. In this paper, we present a $\lambda$-schedule procedure that generalizes the TD($\lambda$) algorithm to the case when the parameter $\lambda$ could vary with time-step. This allows flexibility in weight assignment, i.e., the user can specify the weights assigned to different $n$-step returns by choosing a sequence $\{\lambda_t\}_{t \geq 1}$. Based on this procedure, we propose an on-policy algorithm - TD($\lambda$)-schedule, and two off-policy algorithms - GTD($\lambda$)-schedule and TDC($\lambda$)-schedule, respectively. We provide proofs of almost sure convergence for all three algorithms under a general Markov noise framework.
翻译:从数据样本中学习给定政策的价值函数是加强学习的一个重要问题。 TD( $\ lambda$) 是解决这一问题的一种流行的算法。 但是, 由参数 $\ lambda$ 控制, 由增加美元 以指数 美元 逐倍下降 。 在本文中, 当参数 $( $\ lambda$) 可能随时间步骤变化而变化时, 我们提出了一个 $( $\ lambda$) 的算法, 将 TD( $\ lambda$) 的算法 概括化为这个案例。 这允许在权重分配中具有灵活性, 也就是说, 用户可以通过选择 $ lambda_ t\ geq 1 $ 来指定给不同的 $( lambda$ ) 的分数 。 基于此程序, 我们提出了一个 政策算法 —— TD( $\ lambda$) 时间表 和 2 脱 运算法框架下 几乎三个 的一致标准。