Application of Deep Reinforcement Learning (DRL) algorithms in real-world robotic tasks faces many challenges. On the one hand, reward-shaping for complex tasks is difficult and may result in sub-optimal performances. On the other hand, a sparse-reward setting renders exploration inefficient, and exploration using physical robots is of high-cost and unsafe. In this paper we propose a method of learning challenging sparse-reward tasks utilizing existing controllers. Built upon Deep Deterministic Policy Gradients (DDPG), our algorithm incorporates the controllers into stages of exploration, Q-value estimation as well as policy update. Through experiments ranging from stacking blocks to cups, we present a straightforward way of synthesizing these controllers, and show that the learned state-based or image-based policies steadily outperform them. Compared to previous works of learning from demonstrations, our method improves sample efficiency by orders of magnitude and can learn online in a safe manner. Overall, our method bears the potential of leveraging existing industrial robot manipulation systems to build more flexible and intelligent controllers.
翻译:在现实世界机器人任务中应用深强化学习(DRL)算法面临许多挑战。 一方面,为复杂任务进行奖赏分配很困难,并可能导致次优性表现。 另一方面,一个微薄的奖励环境使得勘探效率低下,而使用物理机器人的探索则成本高且不安全。 在本文中,我们提出了一个方法来学习利用现有控制器的富有挑战性的稀薄任务。在深确定性政策梯度(DDPG)的基础上,我们的算法将控制器纳入探索、Q价值估计和政策更新等阶段。通过从堆叠块到杯子的实验,我们展示了将这些控制器合成的直截了当的方法,并展示了基于国家或图像的先进政策逐渐超越了它们。与以往的从演示中学习工作相比,我们的方法通过数量序列提高抽样效率,并且可以安全地在网上学习。 总的来说,我们的方法承载着利用现有工业机器人操纵系统来建立更灵活和智能控制器的潜力。