Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In this paper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policy parameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance sampling approaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our gradient estimate against state-of-the-art methods, and show that it outperforms the baselines in terms of sample efficiency on classical control tasks.
翻译:离政策强化学习(RL)具有提高数据效率的希望,因为它允许样本再利用,并有可能实现与环境的安全互动。当前的离政策梯度方法要么存在高度偏差,要么存在高度差异,提供往往不可靠的估计。低效率的价格在互动驱动机器人学习等现实世界情景中变得很明显,在互动驱动的机器人学习中,RL的成功相当有限,而且非常高的抽样成本阻碍了直接应用。在本文件中,我们提出了一个非对称贝尔曼方程式,该方程式可以用封闭的形式解决。解决方案是政策参数的不同,并提供了对政策梯度的估计。这样,我们避免了重要性的高度差异,避免了半梯度方法的高度偏差。我们用实验方法分析了我们梯度估算的质量,并表明它超过了典型控制任务样本效率的基线。