Recent methods for imitation learning directly learn a $Q$-function using an implicit reward formulation rather than an explicit reward function. However, these methods generally require implicit reward regularization to improve stability and often mistreat absorbing states. Previous works show that a squared norm regularization on the implicit reward function is effective, but do not provide a theoretical analysis of the resulting properties of the algorithms. In this work, we show that using this regularizer under a mixture distribution of the policy and the expert provides a particularly illuminating perspective: the original objective can be understood as squared Bellman error minimization, and the corresponding optimization problem minimizes a bounded $\chi^2$-Divergence between the expert and the mixture distribution. This perspective allows us to address instabilities and properly treat absorbing states. We show that our method, Least Squares Inverse Q-Learning (LS-IQ), outperforms state-of-the-art algorithms, particularly in environments with absorbing states. Finally, we propose to use an inverse dynamics model to learn from observations only. Using this approach, we retain performance in settings where no expert actions are available.
翻译:使用隐含的奖赏提法而不是明确的奖赏功能来直接学习最近模仿学习的方法直接学习$Q的功能。然而,这些方法通常要求隐含的奖赏监管,以改善稳定性,并经常对吸收状态的滥用。以前的工作表明,隐含的奖赏功能的正方规范规范规范规范是有效的,但并不提供对算法属性的理论分析。在这项工作中,我们表明,在混合分配政策下使用这种正规化器,专家提供了一种特别启发性的观点:最初的目标可以被理解为对方的贝尔曼错误最小化,相应的优化问题可以最大限度地减少专家与混合分布之间的约束值$\chi%2美元。这一视角使我们能够解决不稳定性并妥善处理吸收状态。我们展示了我们的方法,即最小的偏向Q-LS-IQ(LS-IQ),在吸收状态的环境中,特别是在吸收状态的环境中,优于最新水平的算法。最后,我们提议使用一个反动动模型从观察中学习。使用这种方法,我们在没有专家行动的地方保持业绩。</s>