Efficient credit assignment is essential for reinforcement learning algorithms in both prediction and control settings. We describe a unified view on temporal-difference algorithms for selective credit assignment. These selective algorithms apply weightings to quantify the contribution of learning updates. We present insights into applying weightings to value-based learning and planning algorithms, and describe their role in mediating the backward credit distribution in prediction and control. Within this space, we identify some existing online learning algorithms that can assign credit selectively as special cases, as well as add new algorithms that assign credit backward in time counterfactually, allowing credit to be assigned off-trajectory and off-policy.
翻译:高效的信用分配对于加强预测和控制环境中的学习算法至关重要。 我们描述对选择性信用分配的时间差异算法的统一观点。 这些选择性算法应用加权法来量化学习更新的贡献。 我们深入了解对基于价值的学习和规划算法的加权法,并描述其在预测和控制中调节落后信用分配的作用。 在这个空间里,我们确定了一些现有的在线学习算法,可以选择性地将信用分配为特殊案例,并增加新的算法,在时间上逆向分配信用,允许将信贷分配为不轨和不政策。