In the search for more sample-efficient reinforcement-learning (RL) algorithms, a promising direction is to leverage as much external off-policy data as possible. For instance, expert demonstrations. In the past, multiple ideas have been proposed to make good use of the demonstrations added to the replay buffer, such as pretraining on demonstrations only or minimizing additional cost functions. We present a new method, able to leverage both demonstrations and episodes collected online in any sparse-reward environment with any off-policy algorithm. Our method is based on a reward bonus given to demonstrations and successful episodes (via relabeling), encouraging expert imitation and self-imitation. Our experiments focus on several robotic-manipulation tasks across two different simulation environments. We show that our method based on reward relabeling improves the performance of the base algorithm (SAC and DDPG) on these tasks. Finally, our best algorithm STIR$^2$ (Self and Teacher Imitation by Reward Relabeling), which integrates into our method multiple improvements from previous works, is more data-efficient than all baselines.
翻译:在寻找更具抽样效率的强化学习算法(RL)的过程中,一个有希望的方向是尽可能多地利用外部政策外数据。例如,专家演示。过去曾提出多种想法,以便很好地利用重新播放缓冲器中添加的演示,例如仅对演示进行预备培训,或最大限度地减少额外的成本功能。我们提出了一个新方法,能够利用在任何稀疏的回报环境中收集的演示和事件与任何非政策性算法进行杠杆作用。我们的方法是基于对示范和成功事件给予奖励(通过重新标签),鼓励专家模仿和自我模仿。我们的实验重点是在两种不同的模拟环境中执行若干机器人操纵任务。我们显示,我们基于奖励重新标签的方法提高了这些任务的基础算法(SAC和DDPG)的性能。最后,我们最好的算法STIR$2美元(通过Reward Retabting而自我和教师吸引)比所有基线都更具有数据效率。</s>