Control policies from imitation learning can often fail to generalize to novel environments due to imperfect demonstrations or the inability of imitation learning algorithms to accurately infer the expert's policies. In this paper, we present rigorous generalization guarantees for imitation learning by leveraging the Probably Approximately Correct (PAC)-Bayes framework to provide upper bounds on the expected cost of policies in novel environments. We propose a two-stage training method where a latent policy distribution is first embedded with multi-modal expert behavior using a conditional variational autoencoder, and then "fine-tuned" in new training environments to explicitly optimize the generalization bound. We demonstrate strong generalization bounds and their tightness relative to empirical performance in simulation for (i) grasping diverse mugs, (ii) planar pushing with visual feedback, and (iii) vision-based indoor navigation, as well as through hardware experiments for the two manipulation tasks.
翻译:由于演示不完善或无法精确推断专家的政策,模仿学习的控制政策往往无法概括为新环境。 在本文中,我们通过利用“准正(PAC)-Bayes(PAC)-Bayes(PAC)-Bayes(Bayes)”框架为模仿学习提供严格的通用保障,以提供新环境中政策预期成本的上限。 我们提出了一个两阶段培训方法,即潜在政策分布首先与多模式专家行为相结合,先使用有条件的变异自动编码器,然后在新的培训环境中“调整”以明确优化一般化约束。 我们展示了强大的概括性界限及其与模拟中经验性表现的紧密性,以便(一) 捕捉不同的杯子,(二) 用视觉反馈进行平板推,(三) 以视觉为基础的室内导航,以及两项操纵任务的硬件实验。