We consider imitation learning problems where the expert has access to a per-episode context that is hidden from the learner, both in the demonstrations and at test-time. While the learner might not be able to accurately reproduce expert behavior early on in an episode, by considering the entire history of states and actions, they might be able to eventually identify the context and act as the expert would. We prove that on-policy imitation learning algorithms (with or without access to a queryable expert) are better equipped to handle these sorts of asymptotically realizable problems than off-policy methods and are able to avoid the latching behavior (naive repetition of past actions) that plagues the latter. We conduct experiments in a toy bandit domain that show that there exist sharp phase transitions of whether off-policy approaches are able to match expert performance asymptotically, in contrast to the uniformly good performance of on-policy approaches. We demonstrate that on several continuous control tasks, on-policy approaches are able to use history to identify the context while off-policy approaches actually perform worse when given access to history.
翻译:我们考虑的是模拟学习问题,即专家在演示和测试时能够从学习者那里获得隐藏在演示和测试时的每个单元背景。虽然学习者可能无法通过考虑国家和行动的整个历史,在某一插曲中早期准确复制专家行为,但他们最终可能能够确定背景,并像专家那样行事。我们证明,在政策上模仿学习算法(有或没有可查询专家)比非政策性方法更有能力处理这些无法想象的问题,并能够避免困扰后者的拉链行为(以往行动的重复)。我们在一个微小的土匪领域进行实验,表明在离政策方法能否与专家业绩相对齐进地匹配方面存在着急剧的阶段性转变,这与政策性方法的统一良好表现形成对照。我们证明,在几个连续的控制任务上,政策上的方法能够利用历史来识别背景,而离政策方法在获得历史时实际上表现更差。