We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.
翻译:我们考虑了脱线强化学习(RL)设置,代理商在这种设置中只希望从数据中优化政策而无需进一步环境互动。在脱线RL中,分配转换成为主要困难来源,因为目标政策偏离了数据收集所用行为政策,而目标政策偏离了数据收集所用行为政策。这通常造成对行动价值的过高估计,对使用靴子的无型算法造成严重问题。为了缓解问题,前离线RL算法经常使用鼓励低估行动价值的尖端技术,这增加了一组需要适当调整的超参数。在本文中,我们提出了一个离线的RL算法,防止以更有原则的方式过高估计。我们的算法,OptiDICE, 直接估计了最佳政策的固定分布修正,并不依赖政策等级,与以前的离线 RL算法不同。我们用一套广泛的基准数据集显示,OptiDICE在离线 RL上,我们显示OptiDICE使用最先进的方法进行竞争。