Reinforcement learning often suffer from the sparse reward issue in real-world robotics problems. Learning from demonstration (LfD) is an effective way to eliminate this problem, which leverages collected expert data to aid online learning. Prior works often assume that the learning agent and the expert aim to accomplish the same task, which requires collecting new data for every new task. In this paper, we consider the case where the target task is mismatched from but similar with that of the expert. Such setting can be challenging and we found existing LfD methods can not effectively guide learning in mismatched new tasks with sparse rewards. We propose conservative reward shaping from demonstration (CRSfD), which shapes the sparse rewards using estimated expert value function. To accelerate learning processes, CRSfD guides the agent to conservatively explore around demonstrations. Experimental results of robot manipulation tasks show that our approach outperforms baseline LfD methods when transferring demonstrations collected in a single task to other different but similar tasks.
翻译:强化学习往往受到现实世界机器人问题中微弱的奖励问题的影响。 从演示中学习(LfD)是解决这一问题的有效方法,它利用收集的专家数据来帮助在线学习。 先前的工作往往假设学习代理和专家的目的是完成同样的任务,这要求为每一项新任务收集新的数据。 在本文中,我们认为目标任务与专家任务不匹配但相似的情况。 这种设置可能具有挑战性,并且我们发现现有的LfD方法无法有效地指导在不匹配的新任务中学习与微薄的奖励。 我们建议从演示中形成保守的奖励,通过估计的专家价值函数来决定稀薄的奖励。 为了加快学习过程, CRSfD指导代理围绕演示进行保守的探索。 机器人操纵任务的实验结果显示,在将一项单项任务收集的演示转移到其他不同但相似的任务时,我们的方法比基线LfD方法要差。