Bellman-一致的离线强化学习悲观主义 (Bellman-consistent Pessimism for Offline Reinforcement Learning)

The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.

翻译：悲观主义的运用,当关于缺乏详尽探索的数据集的推理最近在离线强化学习中占据突出位置时,悲观主义的运用在离线强化学习中占据了突出位置。尽管它增加了强健的算法,但过于悲观的推理在阻止发现好政策方面同样具有破坏性,而这正是流行的基于奖金的悲观主义问题。在本文中,我们引入了对一般功能近似采用贝尔曼一致悲观主义的概念:我们没有计算出一个与价值功能相对应的低点界限,而是在初始状态对符合贝尔曼方程式的一套功能实行悲观主义。我们的理论保证仅仅要求贝尔曼封闭作为探索环境的标准,在这种情况下,基于奖金的悲观主义不能提供保证。即使在线性功能近似的特殊情况下,如果更强烈的直观假设维持,当行动空间是有限的时,我们的结果在基于奖金的最近方法的抽样复杂性方面会得到改善。值得注意的是,我们的算法会自动适应后视中的最佳偏差交易。