Bellman-一致的离线强化学习悲观主义 (Bellman-consistent Pessimism for Offline Reinforcement Learning)

The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear MDPs where stronger function-approximation assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.

翻译：悲观主义的运用,当关于缺乏详尽探索的数据集的推理最近在离线强化学习中占据了突出位置时,悲观主义的运用在离线强化探索性学习中得到了突出的地位。尽管我们的理论保证只要求贝尔曼封闭作为探索性环境的标准,在这种情况下,基于奖金的悲观主义不能提供保证。即使在线性 MDP 的特殊情况下, 更强大的功能匹配性假设维持着, 我们的结果在基于奖金的最近方法中得到了改善 $\ mathcal{O} (d) 美元, 当行动空间是有限的时, 我们没有计算出与价值功能相匹配的点点性较低界限, 我们在初始状态上对符合贝尔曼方程式的功能实施悲观主义。我们的理论保证仅仅要求贝尔曼封闭性作为探索性环境的标准, 而在这种情况下,基于奖金的悲观主义不能提供保证。即使在线性 MDPs 的特殊情况下, 更强大的功能匹配性假设维持着, 我们的结果在基于最近以 $\mathcal {O} (d) (d) 在行动空间是有限的时, 我们的抽样复杂性中会对其抽样法的复杂度进行改进。。明显地, 我们的算算法会自动适应后视后视最差前的超前的平调。