Offline reinforcement learning (RL) provides a promising direction to exploit the massive amount of offline data for complex decision-making tasks. Due to the distribution shift issue, current offline RL algorithms are generally designed to be conservative in value estimation and action selection. However, such conservatism can impair the robustness of learned policies when encountering observation deviation under realistic conditions, such as sensor errors and adversarial attacks. To trade off robustness and conservatism, we propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique. In RORL, we explicitly introduce regularization on the policy and the value function for states near the dataset, as well as additional conservative value estimation on these OOD states. Theoretically, we show RORL enjoys a tighter suboptimality bound than recent theoretical results in linear MDPs. We demonstrate that RORL can achieve state-of-the-art performance on the general offline RL benchmark and is considerably robust to adversarial observation perturbations.
翻译:离线强化学习(RL)为利用大量离线数据进行复杂的决策任务提供了一个有希望的方向。由于分布转移问题,目前的离线RL算法一般在价值估计和行动选择方面设计保守。然而,这种保守主义在现实条件下遇到感应错误和对抗性攻击等观察偏差时,会损害所学政策的稳健性。为了交换稳健性和保守性,我们提议用一种新颖的保守的平滑技术来交换罗布斯特离线强化学习(ROL)。在RORL中,我们明确引入了对数据集附近各州的政策和价值功能的规范化,以及对这些OOD状态的额外保守价值估计。理论上,我们显示RORL比线性MDP的最新理论结果更加优于亚性。我们证明,RORL能够在一般离线基准上实现最先进的业绩,并且相当强于对抗性观测。