Off-policy prediction -- learning the value function for one policy from data generated while following another policy -- is one of the most challenging subproblems in reinforcement learning. This paper presents empirical results with eleven prominent off-policy learning algorithms that use linear function approximation: five Gradient-TD methods, two Emphatic-TD methods, Off-policy TD($\lambda$), Vtrace, and versions of Tree Backup and ABQ modified to apply to a prediction setting. Our experiments used the Collision task, a small idealized off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. We assessed the performance of the algorithms according to their learning rate, asymptotic error level, and sensitivity to step-size and bootstrapping parameters. By these measures, the eleven algorithms can be partially ordered on the Collision task. In the top tier, the two Emphatic-TD algorithms learned the fastest, reached the lowest errors, and were robust to parameter settings. In the middle tier, the five Gradient-TD algorithms and Off-policy TD($\lambda$) were more sensitive to the bootstrapping parameter. The bottom tier comprised Vtrace, Tree Backup, and ABQ; these algorithms were no faster and had higher asymptotic error than the others. Our results are definitive for this task, though of course experiments with more tasks are needed before an overall assessment of the algorithms' merits can be made.
翻译:离政策预测 -- -- 从生成的数据中学习一项政策的价值函数,同时遵循另一项政策 -- -- 是强化学习中最具挑战性的子问题之一。本文件介绍了11种著名的非政策学习算法的经验结果,这些算法使用线性函数近似值:五种梯度-TD方法、两种Empphatic-TD方法、两种Empphatic-TD方法、脱政策TD($\lambda$)、Vtrace,以及经修改的树备份和ABQ版本,以适用于预测设置。我们的实验使用了碰撞任务,一个小的离政策问题理想化了,类似于一个自主汽车试图预测它是否会与障碍相撞。我们在中层,五个梯度-TD算法和脱轨算法的性能与其学习速度、无症状差差差差差、对级和踢踏参数的敏感性。根据这些措施,11种算法算法可以部分地适用于校正任务。在顶端一级,两种Empat-TD算法可以比下级更快速、最接近于参数设置。在中层一级,五个梯级的算法算算法和最高级的轨算算算法是更精度。在Atlexxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx