During recent years, deep reinforcement learning (DRL) has made successful incursions into complex decision-making applications such as robotics, autonomous driving or video games. In the search for more sample-efficient algorithms, a promising direction is to leverage as much external off-policy data as possible. One staple of this data-driven approach is to learn from expert demonstrations. In the past, multiple ideas have been proposed to make good use of the demonstrations added to the replay buffer, such as pretraining on demonstrations only or minimizing additional cost functions. We present a new method, able to leverage demonstrations and episodes collected online in any sparse-reward environment with any off-policy algorithm. Our method is based on a reward bonus given to demonstrations and successful episodes, encouraging expert imitation and self-imitation. First, we give a reward bonus to the transitions coming from demonstrations to encourage the agent to match the demonstrated behaviour. Then, upon collecting a successful episode, we relabel its transitions with the same bonus before adding them to the replay buffer, encouraging the agent to also match its previous successes. Our experiments focus on manipulation robotics, specifically on three tasks for a 6 degrees-of-freedom robotic arm in simulation. We show that our method based on reward relabeling improves the performance of the base algorithm (SAC and DDPG) on these tasks, even in the absence of demonstrations. Furthermore, integrating into our method two improvements from previous works allows our approach to outperform all baselines.
翻译:近年来,深入强化学习(DRL)成功进入了复杂的决策应用,如机器人、自主驾驶或视频游戏等。在寻找更具抽样效率的算法时,一个有希望的方向是尽可能利用外部政策数据。这种数据驱动方法的主要内容之一是从专家演示中学习。过去曾提出多种想法,以便很好地利用在重新播放缓冲中添加的演示,例如仅对演示进行培训,或尽量减少额外的成本功能。我们提出了一种新的方法,能够利用任何政策外算法在任何稀薄的变迁环境中收集的演示和事件。我们的方法的基础是为演示和成功事件提供奖励奖金,鼓励专家模仿和自我缩影。首先,我们给从演示到鼓励代理人与所展示的行为相匹配的过渡提供奖励。然后,在收集成功插在重新显示缓冲之前,用同样的奖金重新贴上其过渡的奖金,鼓励代理人也匹配其以往的成功。我们的实验重点是操纵机器人,特别是用于演示和成功事例的三项任务,即鼓励专家模仿和自我智能的D系统模拟方法。我们从六度的升级方法中改进了我们的升级方法。