Adversarial imitation learning (AIL) has become a popular alternative to supervised imitation learning that reduces the distribution shift suffered by the latter. However, AIL requires effective exploration during an online reinforcement learning phase. In this work, we show that the standard, naive approach to exploration can manifest as a suboptimal local maximum if a policy learned with AIL sufficiently matches the expert distribution without fully learning the desired task. This can be particularly catastrophic for manipulation tasks, where the difference between an expert and a non-expert state-action pair is often subtle. We present Learning from Guided Play (LfGP), a framework in which we leverage expert demonstrations of multiple exploratory, auxiliary tasks in addition to a main task. The addition of these auxiliary tasks forces the agent to explore states and actions that standard AIL may learn to ignore. Additionally, this particular formulation allows for the reusability of expert data between main tasks. Our experimental results in a challenging multitask robotic manipulation domain indicate that LfGP significantly outperforms both AIL and behaviour cloning, while also being more expert sample efficient than these baselines. To explain this performance gap, we provide further analysis of a toy problem that highlights the coupling between a local maximum and poor exploration, and also visualize the differences between the learned models from AIL and LfGP.
翻译:模拟模拟学习(AIL)已成为受监督模仿学习的流行替代方法,减少了后者的分布变化。然而,AIL需要在一个在线强化学习阶段进行有效的探索。在这项工作中,我们表明,如果与AIL所学的政策与专家的分布充分匹配,而没有完全了解所期望的任务,则标准、天真的探索方法可以显示为当地最不理想的,如果与AIL所学的政策充分匹配专家的分布,而没有完全了解所期望的任务。这对操作任务来说尤其具有灾难性,因为专家和非专家国家行动对方的差别往往很微妙。我们介绍“从制导游戏(LfGP)学习”(LfGP)是一个框架,在这个框架中,我们除了主要任务之外,还利用多种探索和辅助任务的专家演示。这些辅助任务使代理人不得不探索标准AIL可能忽略的国家和行动。此外,这一特别的提法使得专家数据在主要任务之间可以重新出现。我们在具有挑战性的多塔斯机器人操纵域的实验结果表明,LfGP大大超越了AIL和行为克隆,同时也比这些基线更有效率。为了解释这一业绩差距,我们进一步分析从AIL所了解到的模型与L的差别。