以模型为基础的离线规划 (Model-Based Offline Planning)

Offline learning is a key part of making reinforcement learning (RL) useable in real systems. Offline RL looks at scenarios where there is data from a system's operation, but no direct access to the system when learning a policy. Recent work on training RL policies from offline data has shown results both with model-free policies learned directly from the data, or with planning on top of learnt models of the data. Model-free policies tend to be more performant, but are more opaque, harder to command externally, and less easy to integrate into larger systems. We propose an offline learner that generates a model that can be used to control the system directly through planning. This allows us to have easily controllable policies directly from data, without ever interacting with the system. We show the performance of our algorithm, Model-Based Offline Planning (MBOP) on a series of robotics-inspired tasks, and demonstrate its ability leverage planning to respect environmental constraints. We are able to find near-optimal polices for certain simulated systems from as little as 50 seconds of real-time system interaction, and create zero-shot goal-conditioned policies on a series of environments. An accompanying video can be found here: https://youtu.be/nxGGHdZOFts

翻译：离线学习是使强化学习(RL)在实际系统中可以使用的关键部分。离线学习( RL) 是在实际系统中可以使用强化学习( RL) 的关键部分。离线学习( RL) 查看了系统运行中存在数据但当学习政策时无法直接访问系统的情况。最近从离线数据培训RL政策的工作显示的结果既有直接从数据中学习的无模式政策,也有在数据模型模型上进行规划的结果。无模式政策往往更具有性能,但更不透明,更难以在外部进行指挥,更难融入更大的系统。我们提议了一个离线学习者, 生成一个模型, 可以直接通过规划来控制系统。这使我们能够在不与系统互动的情况下,直接从数据中获取易于控制的政策。我们展示了我们的算法“ 以模型为基础的离线规划(MBOP) ” 在一系列机器人启发性任务上的表现, 并展示其对环境制约的杠杆规划能力。我们能够为某些模拟系统找到近最佳的警察, 仅50 秒的实时系统互动, 并在此创建零发目标限制政策。在一系列的图像环境中可以找到。