Most prior approaches to offline reinforcement learning (RL) have taken an iterative actor-critic approach involving off-policy evaluation. In this paper we show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark. The simple one-step baseline achieves this strong performance without many of the tricks used by previously proposed iterative algorithms and is more robust to hyperparameters. We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing off-policy evaluation and magnified by the repeated optimization of policies against those high-variance estimates. In addition, we hypothesize that the strong performance of the one-step algorithm is due to a combination of favorable structure in the environment and behavior policy.
翻译:大部分以往的离线强化学习方法(RL)都采用了涉及离政策评估的迭代行为体-批评方法。 在本文中,我们表明,仅仅使用对行为政策的在政策Q上的估算来采取限制/正规化政策改进的一步就表现得令人惊讶。这一一步算法比以前报告的在大部分D4RL基准上的迭代算法结果要强。简单的一步骤基线在没有以前提议的迭代算法使用的许多伎俩的情况下实现了这一强劲的绩效,而且对超参数来说更加强大。我们争辩说,迭代方法的相对劣势是进行非政策评估过程中固有的高度差异的结果,并且由于针对这些高变差估计反复优化政策而放大了这种差异。此外,我们假设单步算法的强效表现是由于环境和行为政策中有利的结构的结合。