Offline reinforcement learning (RL) learns policies entirely from static datasets, thereby avoiding the challenges associated with online data collection. Practical applications of offline RL will inevitably require learning from datasets where the variability of demonstrated behaviors changes non-uniformly across the state space. For example, at a red light, nearly all human drivers behave similarly by stopping, but when merging onto a highway, some drivers merge quickly, efficiently, and safely, while many hesitate or merge dangerously. Both theoretically and empirically, we show that typical offline RL methods, which are based on distribution constraints fail to learn from data with such non-uniform variability, due to the requirement to stay close to the behavior policy to the same extent across the state space. Ideally, the learned policy should be free to choose per state how closely to follow the behavior policy to maximize long-term return, as long as the learned policy stays within the support of the behavior policy. To instantiate this principle, we reweight the data distribution in conservative Q-learning (CQL) to obtain an approximate support constraint formulation. The reweighted distribution is a mixture of the current policy and an additional policy trained to mine poor actions that are likely under the behavior policy. Our method, CQL (ReDS), is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
翻译:离线强化学习( RL) 完全从静态数据集中学习政策, 从而避免与在线数据收集相关的挑战。 离线 RL 的实际应用将不可避免地要求从数据集中学习,因为显示的行为的变异性会在整个州空间发生不统一的变化。 例如,在红灯下,几乎所有的人类驾驶员都通过停止行动来类似,但在合并到高速公路时,有些驾驶员会快速、高效和安全地合并或合并,而许多驾驶员则会犹豫或危险地合并。 从理论上和从经验上看,我们显示典型的离线 RL 方法,这些方法基于分布限制,无法从非统一变异性的数据中学习,因为要求在整个州空间与行为政策保持同样程度的距离。 理想的情况是,学习后的政策应该可以自由选择如何密切地遵守行为政策,以最大程度的长期回报,只要所学的政策不超出行为政策的支持范围。 为了快速地调整这一原则,我们重新权衡基于保守的 Q 学习( CQL) 的数据分配, 以近似的支持制约度的配置。 在目前的政策中, 重新加权的分布是一种组合中, 并且经过训练后, 我们的政策范围 改进后, 在 政策中, 改进后的政策是一种简单的操作中的一种额外的操作方法, 。