规范的行为行为价值估算 (Regularized Behavior Value Estimation)

Caglar Gulcehre,Sergio Gómez Colmenarejo,Ziyu Wang,Jakub Sygnowski,Thomas Paine,Konrad Zolna,Yutian Chen,Matthew Hoffman,Razvan Pascanu,Nando de Freitas

Offline reinforcement learning restricts the learning process to rely only on logged-data without access to an environment. While this enables real-world applications, it also poses unique challenges. One important challenge is dealing with errors caused by the overestimation of values for state-action pairs not well-covered by the training data. Due to bootstrapping, these errors get amplified during training and can lead to divergence, thereby crippling learning. To overcome this challenge, we introduce Regularized Behavior Value Estimation (R-BVE). Unlike most approaches, which use policy improvement during training, R-BVE estimates the value of the behavior policy during training and only performs policy improvement at deployment time. Further, R-BVE uses a ranking regularisation term that favours actions in the dataset that lead to successful outcomes. We provide ample empirical evidence of R-BVE's effectiveness, including state-of-the-art performance on the RL Unplugged ATARI dataset. We also test R-BVE on new datasets, from bsuite and a challenging DeepMind Lab task, and show that R-BVE outperforms other state-of-the-art discrete control offline RL methods.

翻译：离线强化学习限制学习过程只依赖登录数据而不访问环境。虽然这能够让真实世界应用, 但也带来了独特的挑战。一个重要的挑战就是处理高估培训数据未充分覆盖的州- 对应方的数值造成的错误。由于踢球,这些错误在培训期间被放大,可能导致差异,从而削弱学习能力。为了克服这一挑战, 我们引入了正规化行为价值估计(R- BVE ) 。与大多数在培训期间使用政策改进的方法不同, R- BVE 在培训期间估计行为政策的价值, 并且只在部署时进行政策改进。此外, R- BVE 使用一个排序常规化术语, 有利于数据集中的行动, 从而导致成功的结果。我们为 R- BVE 提供了充分的经验证据, 包括 R- BVE 在 RL 解开的 ATARI 数据集上的最新性表现。我们还在新的数据集上测试 R- BVVE, 从 blod- Lab 任务到具有挑战性的任务, 显示 R- BVE- 控制系统超越其他状态。