Offline model selection (OMS), that is, choosing the best policy from a set of many policies given only logged data, is crucial for applying offline RL in real-world settings. One idea that has been extensively explored is to select policies based on the mean squared Bellman error (MSBE) of the associated Q-functions. However, previous work has struggled to obtain adequate OMS performance with Bellman errors, leading many researchers to abandon the idea. Through theoretical and empirical analyses, we elucidate why previous work has seen pessimistic results with Bellman errors and identify conditions under which OMS algorithms based on Bellman errors will perform well. Moreover, we develop a new estimator of the MSBE that is more accurate than prior methods and obtains impressive OMS performance on diverse discrete control tasks, including Atari games. We open-source our data and code to enable researchers to conduct OMS experiments more easily.
翻译:离线模式选择(OMS), 也就是从一系列只提供登录数据的众多政策中选择最佳政策, 这对于在现实世界环境中应用离线 RL 至关重要。 已经广泛探讨的一个想法是, 根据相关功能中平均平方贝曼差错( MSBE) 来选择政策。 然而, 先前的工作一直努力以 Bellman差错获得适当的 OMS 性能, 导致许多研究人员放弃这个想法。 通过理论和经验分析, 我们解释为什么先前的工作看到Bellman差错的悲观结果, 并找出基于 Bellman差错的 OMS 算法运行良好的条件 。 此外, 我们开发了一个新的 MSBE 估计器, 比以前的方法更准确, 并在包括 Atari 游戏在内的各种离散控制任务上获得令人印象深刻的 OMS 性能。 我们打开了我们的数据和代码, 以便研究人员更容易进行 OMS 实验。