Imitation Learning (IL) is an effective learning paradigm exploiting the interactions between agents and environments. It does not require explicit reward signals and instead tries to recover desired policies using expert demonstrations. In general, IL methods can be categorized into Behavioral Cloning (BC) and Inverse Reinforcement Learning (IRL). In this work, a novel reward function based on probability density estimation is proposed for IRL, which can significantly reduce the complexity of existing IRL methods. Furthermore, we prove that the theoretically optimal policy derived from our reward function is identical to the expert policy as long as it is deterministic. Consequently, an IRL problem can be gracefully transformed into a probability density estimation problem. Based on the proposed reward function, we present a "watch-try-learn" style framework named Probability Density Estimation based Imitation Learning (PDEIL), which can work in both discrete and continuous action spaces. Finally, comprehensive experiments in the Gym environment show that PDEIL is much more efficient than existing algorithms in recovering rewards close to the ground truth.
翻译:光学学习(IL)是一种有效的学习模式,它利用了代理人与环境之间的相互作用。它不需要明确的奖赏信号,而是试图利用专家演示来恢复理想的政策。一般来说,IL方法可以分为行为克隆(BC)和反强化学习(IRL)。在这项工作中,建议IRL发挥基于概率密度估计的新奖励功能,这可以大大减少现有的IRL方法的复杂性。此外,我们证明,从我们奖励功能中得出的理论上最佳政策只要是确定性的,就与专家政策相同。因此,IRL问题可以自然地转化为概率密度估计问题。根据拟议的奖励功能,我们提出了一个名为“观察-try-learn”的风格框架,称为“观察-try-learn”基于光学学习(PDELL),它可以在离散和连续的行动空间发挥作用。此外,在Gym环境中的全面实验表明,在接近地面真相的情况下追回报酬比现有的算法效率要高得多。