End-to-end learning robotic manipulation with high data efficiency is one of the key challenges in robotics. The latest methods that utilize human demonstration data and unsupervised representation learning has proven to be a promising direction to improve RL learning efficiency. The use of demonstration data also allows "warming-up" the RL policies using offline data with imitation learning or the recently emerged offline reinforcement learning algorithms. However, existing works often treat offline policy learning and online exploration as two separate processes, which are often accompanied by severe performance drop during the offline-to-online transition. Furthermore, many robotic manipulation tasks involve complex sub-task structures, which are very challenging to be solved in RL with sparse reward. In this work, we propose a unified offline-to-online RL framework that resolves the transition performance drop issue. Additionally, we introduce goal-aware state information to the RL agent, which can greatly reduce task complexity and accelerate policy learning. Combined with an advanced unsupervised representation learning module, our framework achieves great training efficiency and performance compared with the state-of-the-art methods in multiple robotic manipulation tasks.
翻译:高数据效率的端到端学习机器人操作是机器人的关键挑战之一。 使用人类演示数据和无人监督的代理学习的最新方法已证明是提高RL学习效率的一个大有希望的方向。 使用演示数据还允许使用模拟学习或最近出现的离线强化学习算法的离线数据“升温”RL政策,模拟学习或最近出现的离线强化学习算法。然而,现有工作往往将离线政策学习和在线探索作为两个不同的过程处理,这些过程往往伴随着离线到在线过渡期间业绩严重下降。 此外,许多机器人操作工作涉及复杂的子任务结构,在RL很难以微薄的奖励解决这些任务。 在这项工作中,我们提议建立一个统一的离线到线RL框架,解决过渡性业绩下降问题。 此外,我们向RL代理引入目标识别状态信息,这可以大大降低任务复杂性并加速政策学习。与先进的不受监控的代理学习模块相结合,我们的框架在多个机器人操纵任务中实现了与最先进的方法相比,培训效率和绩效很高。