One of the common ways children learn is by mimicking adults. Imitation learning focuses on learning policies with suitable performance from demonstrations generated by an expert, with an unspecified performance measure, and unobserved reward signal. Popular methods for imitation learning start by either directly mimicking the behavior policy of an expert (behavior cloning) or by learning a reward function that prioritizes observed expert trajectories (inverse reinforcement learning). However, these methods rely on the assumption that covariates used by the expert to determine her/his actions are fully observed. In this paper, we relax this assumption and study imitation learning when sensory inputs of the learner and the expert differ. First, we provide a non-parametric, graphical criterion that is complete (both necessary and sufficient) for determining the feasibility of imitation from the combinations of demonstration data and qualitative assumptions about the underlying environment, represented in the form of a causal model. We then show that when such a criterion does not hold, imitation could still be feasible by exploiting quantitative knowledge of the expert trajectories. Finally, we develop an efficient procedure for learning the imitating policy from experts' trajectories.
翻译:儿童学习的常见方法之一是模仿成人。 模仿学习侧重于学习政策,通过专家制作的演示品产生适当的性能,有未具体说明的性能计量,以及没有观测到的奖赏信号。 模仿学习的流行方法要么直接模仿专家的行为政策(行为克隆),要么学习一种奖励功能,将观察到的专家轨迹(反强化学习)列为优先。然而,这些方法所依赖的假设是,专家用来确定她/他的行动的同系异性得到完全遵守。 在本文中,当学习者与专家的感官投入不同时,我们放松这一假设并研究模仿学习。 首先,我们提供了一种非参数性的图形标准,该标准(既必要又充分)用来确定模拟数据与基本环境的质量假设相结合(以因果关系模型的形式表示)的可行性。 我们然后表明,如果这种标准不成立,那么通过利用专家轨迹的定量知识,模仿仍然可行。 最后,我们制定了一种从专家的轨迹中学习模仿政策的高效程序。