离线无示范无机器人强化学习的工作流程 (A Workflow for Offline Model-Free Robotic Reinforcement Learning)

Offline reinforcement learning (RL) enables learning control policies by utilizing only prior experience, without any online interaction. This can allow robots to acquire generalizable skills from large and diverse datasets, without any costly or unsafe online data collection. Despite recent algorithmic advances in offline RL, applying these methods to real-world problems has proven challenging. Although offline RL methods can learn from prior data, there is no clear and well-understood process for making various design choices, from model architecture to algorithm hyperparameters, without actually evaluating the learned policies online. In this paper, our aim is to develop a practical workflow for using offline RL analogous to the relatively well-understood workflows for supervised learning problems. To this end, we devise a set of metrics and conditions that can be tracked over the course of offline training, and can inform the practitioner about how the algorithm and model architecture should be adjusted to improve final performance. Our workflow is derived from a conceptual understanding of the behavior of conservative offline RL algorithms and cross-validation in supervised learning. We demonstrate the efficacy of this workflow in producing effective policies without any online tuning, both in several simulated robotic learning scenarios and for three tasks on two distinct real robots, focusing on learning manipulation skills with raw image observations with sparse binary rewards. Explanatory video and additional results can be found at sites.google.com/view/offline-rl-workflow

翻译：离线强化学习( RL) 只利用先前的经验, 而不进行任何在线互动, 使学习控制政策能够通过只利用先前的经验, 实现学习控制政策。这样可以让机器人从大型和多样化的数据集中获取通用技能, 而不收集任何昂贵或不安全的在线数据收集。尽管最近离线RL的算法进步, 将这些方法应用于现实世界问题却证明具有挑战性。虽然离线RL的方法可以从先前的数据中学习, 但从模型架构到超参数算法, 在不实际评价在线学习的政策的情况下, 并没有清晰和非常清楚的过程来做出各种设计选择。在本文中, 我们的目标是开发一个实用的工作流程, 用于使用离线的、类似于相对精密的、监督学习问题的离线 RL 流程。为此, 我们设计了一套衡量标准和条件, 可以在离线培训过程中跟踪跟踪这些方法, 并告诉执业者如何调整算法和模型架构, 来改进最后业绩。我们的工作流程来自对保守离线离线的离线计算和超标超标的超值学习行为进行概念的理解。我们展示了在制作两种不靠在线的图像的观察的动态中, 在制作中, 以模拟的模型上, 学习中, 以模拟的模型学习三种的模型学习中, 学习中, 以模拟的模型和智能的模型学习三个的模型学习, 以模拟的模型学习方式学习。