Offline reinforcement learning (RL) provides a promising direction to exploit the massive amount of offline data for complex decision-making tasks. Due to the distribution shift issue, current offline RL algorithms are generally designed to be conservative for value estimation and action selection. However, such conservatism impairs the robustness of learned policies, leading to a significant change even for a small perturbation on observations. To trade off robustness and conservatism, we propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique. In RORL, we explicitly introduce regularization on the policy and the value function for states near the dataset and additional conservative value estimation on these OOD states. Theoretically, we show RORL enjoys a tighter suboptimality bound than recent theoretical results in linear MDPs. We demonstrate that RORL can achieve the state-of-the-art performance on the general offline RL benchmark and is considerably robust to adversarial observation perturbation.
翻译:离线强化学习(RL)为利用大量离线数据进行复杂的决策任务提供了一个有希望的方向。由于分布转移问题,当前离线RL算法一般在价值估计和选择行动方面是保守的。然而,这种保守主义会损害所学政策的稳健性,甚至导致观察上的小扰动发生重大变化。为了交换稳健性和保守主义,我们提议用一种新颖的保守的平滑技术来交换离线强化学习(ROL ) 。在RORL 中,我们明确对接近数据集的各州的政策和价值功能进行规范化,并对这些OOD 状态进行额外的保守价值估计。理论上,我们显示RORL拥有比线性 MDP 最近的理论结果更加严格的亚优性。我们证明,RORL 可以在一般离线基准上实现最先进的业绩,并相当有力地适用于对准性对立性观测。