Application of Deep Reinforcement Learning (DRL) algorithms in robotic tasks faces many challenges. On the one hand, reward-shaping for complex tasks that involve multiple sequences is difficult and may result in sub-optimal performances. On the other hand, a sparse-reward setting renders exploration inefficient, and exploration using physical robots is of high-cost and unsafe. In this paper we propose a method of learning long-horizon sparse-reward tasks utilizing one or more existing controllers. Built upon Deep Deterministic Policy Gradients (DDPG), our algorithm incorporates the controllers into stages of exploration, policy update, and most importantly, learning a heuristic value function that naturally interpolates along task trajectories. Through experiments ranging from stacking blocks to cups, we present a straightforward way of synthesizing these controllers, and show that the learned state-based or image-based policies steadily outperform them. Compared to previous works of learning from demonstrations, our method improves sample efficiency by orders of magnitude. Overall, our method bears the potential of leveraging existing industrial robot manipulation systems to build more flexible and intelligent controllers.
翻译:在机器人任务中应用深强化学习(DRL)算法面临许多挑战。 一方面,为涉及多个序列的复杂任务打分是困难的,可能会导致亚最佳性能。另一方面,由于奖励程度低,勘探效率低,使用物理机器人的探索成本高且不安全。在本文件中,我们提出了一个方法,利用一个或多个现有控制器学习长期和高度分散的奖励任务。在深确定性政策梯度(DDPG)上建立起来,我们的算法将控制器纳入探索、政策更新阶段,最重要的是,学习自然在任务轨迹上相互交错的超值功能。通过从堆叠块到杯子的实验,我们展示了将这些控制器合成的直截了当的方法,并表明以状态或图像为基础的政策稳步地超越了这些控制器。 与以往的从演示中学习的工程相比,我们的方法提高了抽样效率。 总的来说,我们的方法具有利用现有工业机器人操纵系统来建立更灵活和智能控制器的潜力。