Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, particularly in the experience replay setting now commonly used with deep neural networks. Classically, off-policy estimation bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio (via eligibility traces) after each action. Many important off-policy algorithms such as Tree Backup and Retrace rely on this mechanism along with differing protocols for truncating ("cutting") the ratios ("traces") to counteract the excessive variance of the IS estimator. Unfortunately, cutting traces on a per-decision basis is not necessarily efficient; once a trace has been cut according to local information, the effect cannot be reversed later, potentially resulting in the premature truncation of estimated returns and slower learning. In the interest of motivating efficient off-policy algorithms, we propose a multistep operator that permits arbitrary past-dependent traces. We prove that our operator is convergent for policy evaluation, and for optimal control when targeting greedy-in-the-limit policies. Our theorems establish the first convergence guarantees for many existing algorithms including Truncated IS, Non-Markov Retrace, and history-dependent TD($\lambda$). Our theoretical results also provide guidance for the development of new algorithms that jointly consider multiple past decisions for better credit assignment and faster learning.
翻译:从多步返回中从政策上学习,对于抽样效率高的强化学习至关重要,特别是在目前由深神经网络常用的经验重放设置中。 典型地说,非政策估计偏差以每个决定方式纠正:过去的时间差异误差在每次行动之后被瞬时重要性抽样比(IS)重新加权(通过资格痕迹),许多重要的非政策算法,如树备份和雷察等,都依赖这一机制,同时采用不同的协议,以补分解(“裁剪”)比率(“追踪”),以抵消IS估测器的过度差异。不幸的是,在每项决定的基础上削减痕迹不一定有效;一旦根据当地信息削减痕迹,其效果以后就无法逆转,可能导致对估计回报的过早疏漏和学习的缓慢。为了鼓励高效的离政策新算法,我们建议一个多步操作器,允许任意的过去偏差(“裁剪 ”) 。 我们证明,我们的操作者在政策评价方面是趋同,在针对贪婪美元限制政策时,最佳控制。 不幸的是,在确定每个决定的基础上,一旦根据当地信息削减,我们的标准算算算算算,我们的第一个保证,包括不相较快的排序。