The goal of offline reinforcement learning is to learn a policy from a fixed dataset, without further interactions with the environment. This setting will be an increasingly more important paradigm for real-world applications of reinforcement learning such as robotics, in which data collection is slow and potentially dangerous. Existing off-policy algorithms have limited performance on static datasets due to extrapolation errors from out-of-distribution actions. This leads to the challenge of constraining the policy to select actions within the support of the dataset during training. We propose to simply learn the Policy in the Latent Action Space (PLAS) such that this requirement is naturally satisfied. We evaluate our method on continuous control benchmarks in simulation and a deformable object manipulation task with a physical robot. We demonstrate that our method provides competitive performance consistently across various continuous control tasks and different types of datasets, outperforming existing offline reinforcement learning methods with explicit constraints. Videos and code are available at https://sites.google.com/view/latent-policy.
翻译:离线强化学习的目标是从固定数据集中学习一项政策,而无需与环境进一步互动。这一设置将成为一个日益重要的范例,用于实际应用强化学习,例如机器人学习,因为数据收集缓慢且可能具有危险性。现有的离政策算法在静态数据集方面表现有限,原因是外分配行动的外推错误。这导致在培训期间限制政策在支持数据集的范围内选择行动的挑战。我们提议仅仅学习低端行动空间的政策,这样就自然满足了这一要求。我们评估了模拟中的持续控制基准和与物理机器人一起的可变形物体操作任务的方法。我们证明,我们的方法在各种连续控制任务和不同类型的数据集中始终提供竞争性业绩,在明显的限制下运行现有的离线强化学习方法。视频和代码可在https://sites.google.com/view/latent-policy查阅。