We develop algorithms for imitation learning from policy data that was corrupted by temporally correlated noise in expert actions. When noise affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch on to, leading to poor policy performance. To break up these spurious correlations, we apply modern variants of the instrumental variable regression (IVR) technique of econometrics, enabling us to recover the underlying policy without requiring access to an interactive expert. In particular, we present two techniques, one of a generative-modeling flavor (DoubIL) that can utilize access to a simulator, and one of a game-theoretic flavor (ResiduIL) that can be run entirely offline. We find both of our algorithms compare favorably to behavioral cloning on simulated control tasks.
翻译:我们开发了从政策数据中学习的仿真算法,而政策数据在专家行动中被与时间相关的噪音所腐蚀。当噪音影响到记录数据的多个时间步时,它可以表现为州与学习者可能接触到的行动之间的虚假关联,导致政策业绩不佳。为了打破这些虚假的关联,我们应用了经济计量工具变量回归技术的现代变体,使我们能够在不需要互动专家的情况下恢复基本政策。特别是,我们提出了两种技术,一种是能够利用模拟器的基因模型调味(DoubIL),另一种是可以完全脱线运行的游戏理论调味(ResiduIL ) 。我们发现,在模拟控制任务上,我们两种算法都比行为克隆要好。