从向导游戏中学习:改进对流模拟学习探索的有计划分级方法 (Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning)

Effective exploration continues to be a significant challenge that prevents the deployment of reinforcement learning for many physical systems. This is particularly true for systems with continuous and high-dimensional state and action spaces, such as robotic manipulators. The challenge is accentuated in the sparse rewards setting, where the low-level state information required for the design of dense rewards is unavailable. Adversarial imitation learning (AIL) can partially overcome this barrier by leveraging expert-generated demonstrations of optimal behaviour and providing, essentially, a replacement for dense reward information. Unfortunately, the availability of expert demonstrations does not necessarily improve an agent's capability to explore effectively and, as we empirically show, can lead to inefficient or stagnated learning. We present Learning from Guided Play (LfGP), a framework in which we leverage expert demonstrations of, in addition to a main task, multiple auxiliary tasks. Subsequently, a hierarchical model is used to learn each task reward and policy through a modified AIL procedure, in which exploration of all tasks is enforced via a scheduler composing different tasks together. This affords many benefits: learning efficiency is improved for main tasks with challenging bottleneck transitions, expert data becomes reusable between tasks, and transfer learning through the reuse of learned auxiliary task models becomes possible. Our experimental results in a challenging multitask robotic manipulation domain indicate that our method compares favourably to supervised imitation learning and to a state-of-the-art AIL method. Code is available at https://github.com/utiasSTARS/lfgp.

翻译：有效探索仍然是阻碍许多物理系统部署强化学习的重大挑战,对于连续和高层次状态和行动空间的系统,例如机器人操纵器来说尤其如此。挑战在微薄的奖赏环境中更加突出,因为没有设计密集奖赏所需的低级别国家信息。反向模仿学习(AIL)可以部分克服这一障碍,办法是利用专家产生的最佳行为示范,并基本上取代密集的奖赏信息。不幸的是,专家演示的提供并不一定能够提高代理人有效探索的能力,而且正如我们的经验显示的那样,能够导致低效率或停滞学习。我们介绍了“指导游戏”(LfGP),这是一个框架,除了一项主要任务外,我们利用专家展示的多种辅助任务。随后,使用一个等级模型来学习每一项任务奖赏和政策,即利用一个由专家生成的最佳行为示范程序,通过一个时间表将所有任务的探索工作结合起来。这给许多好处是:在具有挑战性的瓶垫/com过渡的主要任务中学习效率得到提高,专家数据成为具有挑战性的、停滞性的学习的学习工具性游戏(LGPLA),在可再利用的模型中,将一个具有挑战性的任务转换为可再利用的AFILILA-方法。然后通过一个可学习的方法,将一个可应用性的任务转换为我们学习的方法,将一个具有可应用性的任务转换为可操作性的工作成果。