Offline reinforcement learning (RL) provides a promising direction to exploit massive amount of offline data for complex decision-making tasks. Due to the distribution shift issue, current offline RL algorithms are generally designed to be conservative in value estimation and action selection. However, such conservatism can impair the robustness of learned policies when encountering observation deviation under realistic conditions, such as sensor errors and adversarial attacks. To trade off robustness and conservatism, we propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique. In RORL, we explicitly introduce regularization on the policy and the value function for states near the dataset, as well as additional conservative value estimation on these states. Theoretically, we show RORL enjoys a tighter suboptimality bound than recent theoretical results in linear MDPs. We demonstrate that RORL can achieve state-of-the-art performance on the general offline RL benchmark and is considerably robust to adversarial observation perturbations.
翻译:离线强化学习(RL)为利用大量离线数据进行复杂的决策任务提供了一个有希望的方向。由于分布转移问题,目前的离线RL算法一般在价值估计和行动选择方面设计保守。然而,这种保守主义在现实条件下遇到感应错误和对抗性攻击等观察偏差时,会损害学习政策的稳健性。为了交换稳健性和保守主义,我们建议用一种新颖的保守的平滑技术来交换离线强化学习(ROL)。在RORL中,我们明确引入了对数据集附近各州的政策和价值功能的规范化,以及对这些州的额外保守值估计。理论上,我们显示RORL拥有比线性MDP最近理论结果更紧密的亚优性。我们证明,RORL可以在一般离线基准上实现最先进的业绩,并相当有力地适用于对立对立性观察。