与基地主计长一起学习长-Horizon Sparass-Reward 机器人操纵器任务 (Learning of Long-Horizon Sparse-Reward Robotic Manipulator Tasks with Base Controllers)

Deep Reinforcement Learning (DRL) enables robots to perform some intelligent tasks end-to-end. However, there are still many challenges for long-horizon sparse-reward robotic manipulator tasks. On the one hand, a sparse-reward setting causes exploration inefficient. On the other hand, exploration using physical robots is of high cost and unsafe. In this paper, we propose a method of learning long-horizon sparse-reward tasks utilizing one or more existing traditional controllers named base controllers in this paper. Built upon Deep Deterministic Policy Gradients (DDPG), our algorithm incorporates the existing base controllers into stages of exploration, value learning, and policy update. Furthermore, we present a straightforward way of synthesizing different base controllers to integrate their strengths. Through experiments ranging from stacking blocks to cups, it is demonstrated that the learned state-based or image-based policies steadily outperform base controllers. Compared to previous works of learning from demonstrations, our method improves sample efficiency by orders of magnitude and improves the performance. Overall, our method bears the potential of leveraging existing industrial robot manipulation systems to build more flexible and intelligent controllers.

翻译：深加学习( DRL) 使机器人能够完成一些智能任务。然而, 长期的象子稀疏的微调机器人操纵任务仍然有许多挑战。一方面, 微弱的奖励设置导致勘探效率低下。另一方面, 物理机器人的探索成本高且不安全。在本文中, 我们建议了一种方法, 利用本文中名为基础控制器的一个或多个现有传统控制器来学习长期的象子稀释任务。在深威慑政策梯级( DDPG) 上构建了我们的算法, 我们的算法将现有的基控制器纳入探索、价值学习和政策更新等阶段。此外, 我们展示了一种直截了当的方法, 将不同的基控制器合成为整合其优势。通过从堆叠到杯子的实验, 我们证明基于状态或图像的政策稳步超出基本控制器。与以往的从演示中学习的工程相比, 我们的方法通过规模的顺序提高了样本效率, 并改进了性能。总之, 我们的方法具有利用现有工业机器人操纵系统来建立更灵活和智能控制器的潜力。