Imitation Learning from observation describes policy learning in a similar way to human learning. An agent's policy is trained by observing an expert performing a task. While many state-only imitation learning approaches are based on adversarial imitation learning, one main drawback is that adversarial training is often unstable and lacks a reliable convergence estimator. If the true environment reward is unknown and cannot be used to select the best-performing model, this can result in bad real-world policy performance. We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric. Our training objective minimizes the Kulback-Leibler divergence (KLD) between the policy and expert state transition trajectories which can be optimized in a non-adversarial fashion. Such methods demonstrate improved robustness when learned density models guide the optimization. We further improve the sample efficiency by rewriting the KLD minimization as the Soft Actor Critic objective based on a modified reward using additional density models that estimate the environment's forward and backward dynamics. Finally, we evaluate the effectiveness of our approach on well-known continuous control environments and show state-of-the-art performance while having a reliable performance estimator compared to several recent learning-from-observation methods.
翻译:从观察中吸取教益 从观察中吸取教益,用与人类学习相似的方式描述政策学习。一个代理人的政策是通过观察一名专家来培训,从事一项任务。虽然许多国有仿照学习方法以对抗性模仿学习为基础,但一个主要的缺点是,对抗性培训往往不稳定,缺乏可靠的趋同估计器。如果真正的环境奖励未知,无法用于选择最佳表现模式,这可能导致实际世界政策业绩不佳。我们建议采用非对抗性的从观察学习方法,加上一种可解释的趋同和性能衡量标准。我们的培训目标是尽量减少政策和专家国家过渡轨迹之间的Kulback-Leiper差异,这种差异可以在非对抗性学习方式中优化。当学习密度模型指导优化时,这种方法显示更加稳健。我们进一步通过使用额外的密度模型来评估环境的前向和后向动态。最后,我们评估了我们关于已知的Kulback-Leib-Leiper 方法的有效性,这种差异可以在最新的持续控制环境中进行优化,并显示一种从可靠的业绩对比,同时展示一种从可靠的业绩方法。