We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.
翻译:我们考虑在平均回报 MDP 中以函数近似值( FA) 进行政策外评估,在平均回报 MDP 中,目标是估算奖励率和差值功能。 对于这个问题,靴式是必需的,并且与政策外学习和 FA 一起,导致致命的三合体( Sutton & Barto, 2018年) 。为了解决致命的三合体,我们建议了两种新奇的算法,在平均回报环境中复制了渐进式TD 算法的可喜成功率。在估计差值函数方面,算法是第一个趋同的离政策直线函数近似算法。在估计奖励率方面,算法是第一个不需要估计密度比例的趋同的离政策性线性函数近效算法。我们从经验上展示了拟议算法及其非线性变法的优势,在简单领域和具有挑战性的机器人模拟任务中,超越了竞争性的密度比法。