Despite massive empirical evaluations, one of the fundamental questions in imitation learning is still not fully settled: does AIL (adversarial imitation learning) provably generalize better than BC (behavioral cloning)? We study this open problem with tabular and episodic MDPs. For vanilla AIL that uses the direct maximum likelihood estimation, we provide both negative and positive answers under the known transition setting. For some MDPs, we show that vanilla AIL has a worse sample complexity than BC. The key insight is that the state-action distribution matching principle is weak so that AIL may generalize poorly even on visited states from the expert demonstrations. For another class of MDPs, vanilla AIL is proved to generalize well even on non-visited states. Interestingly, its sample complexity is horizon-free, which provably beats BC by a wide margin. Finally, we establish a framework in the unknown transition scenario, which allows AIL to explore via reward-free exploration strategies. Compared with the best-known online apprenticeship learning algorithm, the resulting algorithm improves the sample complexity and interaction complexity.
翻译:尽管进行了大规模的经验评估,但模拟学习的基本问题之一仍未完全解决:AIL(对抗性模拟学习)是否比BC(行为克隆)更普遍化?我们用表格和单数MDP研究这个公开的问题。对于使用最直接可能性估计的Vanilla AIL,我们在已知的过渡环境中提供了消极和肯定的答案。对于一些MDP,我们显示Vanilla AIL比BC的样本复杂性更差。关键见解是,国家行动分布匹配原则薄弱,因此,AIL甚至可能无法对专家演示访问过的州进行广泛化。对于另一类MDPs来说,Vanilla AIL被证明甚至能够对非访问的州进行广泛化。有趣的是,其样本复杂性是没有地平线的,可以大幅度地超过BC。最后,我们在未知的过渡假设中建立了一个框架,允许AIL通过无报酬的探索策略进行探索。与最著名的在线学徒学习算法相比,由此产生的算法提高了样本的复杂性和互动的复杂性。