Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across $\lambda$-values in an off-policy control task.
翻译:从多步返回中从政策上学习对于抽样高效的强化学习至关重要,但抵消政策外偏向而又不加剧差异却具有挑战性。 典型地说,政策外偏向以每个决定方式纠正:过去的时间差异错误在每次行动后都通过资格追踪的瞬时重要性抽样(IS)比率进行重新加权。 许多政策外算法依靠这个机制,同时有不同的协议来削减IS比率以抵消IS 估计值的差异。 不幸的是,一旦完全缩小了痕迹,效果就无法逆转。这导致了信用分配战略的发展,而这种战略又考虑到过去多次的经验。这些轨迹觉察方法尚未进行广泛分析,其理论理由仍然不确定。 在本文中,我们提议一个多步操作器,既能表达每项决定,又能表达轨迹觉察方法。 我们在表格设置中证明了我们的操作员的趋同条件,为多种现有方法以及许多新方法确立了第一个保证。 最后,我们引入了Recent-Bound Encle Sampling(RBIS),它利用轨迹意识在$lam-dvalue上进行强有力的控制。