In practice, imitation learning is preferred over pure reinforcement learning whenever it is possible to design a teaching agent to provide expert supervision. However, we show that when the teaching agent makes decisions with access to privileged information that is unavailable to the student, this information is marginalized during imitation learning, resulting in an "imitation gap" and, potentially, poor results. Prior work bridges this gap via a progression from imitation learning to reinforcement learning. While often successful, gradual progression fails for tasks that require frequent switches between exploration and memorization. To better address these tasks and alleviate the imitation gap we propose 'Adaptive Insubordination' (ADVISOR). ADVISOR dynamically weights imitation and reward-based reinforcement learning losses during training, enabling on-the-fly switching between imitation and exploration. On a suite of challenging tasks set within gridworlds, multi-agent particle environments, and high-fidelity 3D simulators, we show that on-the-fly switching with ADVISOR outperforms pure imitation, pure reinforcement learning, as well as their sequential and parallel combinations.
翻译:在实践中,只要有可能设计一个教学代理来提供专家监督,模仿学习比纯强化学习更可取。然而,我们表明,当教学代理在决定如何获得学生无法获得的特许信息时,这种信息在模仿学习过程中被边际化,造成“缩微差距”,并可能出现不良结果。先前的工作通过从模仿学习逐步升级到强化学习,弥合了这一差距。虽然通常成功,但对于需要经常转换勘探和记忆之间的交换的任务,逐步升级失败。为了更好地处理这些任务并缩小我们提议的“适应性不服从”(ADVISOR)模仿差距。ADVISOR动态加权和奖励性强化学习在培训期间的学习损失,使得在模仿和探索之间能够进行即时转换。对于在网格世界内设置的一套具有挑战性的任务,多剂粒子环境,以及高纤维3D模拟器,我们展示,用ADVISOR的在飞地转换超越了纯仿造、纯强化学习,以及其相继和平行组合。