Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootstrapping, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)--a policy-dependent model--and linearly combining them with instantaneous rewards. We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the $\eta$-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledge--with a parameter $\eta$ capturing how much to rely on each. We illustrate that incorporating predictive knowledge through an $\eta\gamma$-discounted SF model makes more efficient use of sampled experience, compared to either extreme, i.e. bootstrapping entirely on the value function estimate, or bootstrapping on the product of separately estimated successor features and instantaneous reward models. We empirically show this approach leads to faster policy evaluation and better control performance, for tabular and nonlinear function approximations, indicating scalability and generality.
翻译:估计价值功能是强化学习算法的核心组成部分。时间差异(TD)学习算法使用靴子,即利用随后时间步骤的数值估计数更新价值功能,以学习目标为学习目标,即利用随后时间步骤的数值估计数更新价值功能。或者,可以将价值功能更新为学习目标,通过分别预测后续特征(SF) -- -- 政策依赖模式和线性将其与瞬时奖励相结合,分别预测后续特征(SF) -- -- 政策依赖模式和线性结合。我们注重在估计价值函数时使用的靴子追踪目标,并提出新的后备目标,即$-美元-回报混合,将价值预测性知识(TD方法使用)与(继承者)特征预测性知识间接结合起来,同时使用一个参数($/eta美元)了解每项需要依赖的程度。我们指出,通过一个 $\eta\gamma$-discounced SF模型将预测性知识纳入比较高效地利用抽样经验,与极端功能相比,即完全根据价值函数估算,或对单独估计后续特征和瞬间奖励模型的产品进行靴子。我们从经验上展示了一种更精确的精确性,我们以显示这一方法显示了更精确性,从而导致更快地显示业绩和一般的状态和精确性。