Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent. This is because partial observability gives rise to hidden confounders in the causal graph. We break down the space of confounded imitation learning problems and identify three settings with different data requirements in which the correct imitation policy can be identified. We then introduce an algorithm for deconfounded imitation learning, which trains an inference model jointly with a latent-conditional policy. At test time, the agent alternates between updating its belief over the latent and acting under the belief. We show in theory and practice that this algorithm converges to the correct interventional policy, solves the confounding issue, and can under certain assumptions achieve an asymptotically optimal imitation performance.
翻译:当专业示威者的感官投入与模仿剂不同时,标准仿造学习就会失败。 这是因为部分可观察性在因果图表中引起隐蔽的混乱。 我们分解了虚构的模仿学习问题的空间,并找出了三个有不同数据要求的设置,从而可以识别正确的仿造政策。 然后我们引入了一种无根据的仿造学习算法,用一种潜质政策来训练一种推论模型。 在测试时,代理人在更新其对潜伏物的信念和在信仰下行动之间交替。 我们在理论和实践上表明,这种算法会与正确的干预政策汇合,解决纠缠的问题,并在某些假设下可以实现一种无症状的最佳模仿性表现。