Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Directly applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. Previous methods tackle such problem by penalizing the Q-values of OOD actions or constraining the trained policy to be close to the behavior policy. Nevertheless, such methods typically prevent the generalization of value functions beyond the offline data and also lack precise characterization of OOD data. In this paper, we propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints. Specifically, PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms.
翻译:离线强化学习(RL) 旨在从先前收集的数据集中学习政策,而不探索环境。 直接对离线RL(PBRL)应用离线的离政策算法通常不会成功, 原因是分配(OOOD)行动造成的外推错误。 以前的方法通过惩罚OOOD行动的Q值或限制经过培训的政策使其接近行为政策来解决这个问题。 然而,这些方法通常防止除离线数据以外的价值功能普遍化,也缺乏OOOD数据的精确定性。 在本文中,我们建议对离线的RL(PBRL)采用悲观式的推论式推算法,这是纯粹由不确定性驱动的离线算法,没有明确的政策限制。 具体地说, PBRL通过对制的功能的分歧进行不确定性量化,并通过根据估计不确定性的不确定性来惩罚价值功能来进行悲观性更新。 为了处理外推错误,我们进一步提议一种新的OOODD取样方法。 我们表明,这种OOD抽样和悲观式的靴式采样使在线式的MDP4号的理论基础实验中,从而更好地展示了对PDP4的模型进行基础的测试。