In many sequential decision-making problems (e.g., robotics control, game playing, sequential prediction), human or expert data is available containing useful information about the task. However, imitation learning (IL) from a small amount of expert data can be challenging in high-dimensional environments with complex dynamics. Behavioral cloning is a simple method that is widely used due to its simplicity of implementation and stable convergence but doesn't utilize any information involving the environment's dynamics. Many existing methods that exploit dynamics information are difficult to train in practice due to an adversarial optimization process over reward and policy approximators or biased, high variance gradient estimators. We introduce a method for dynamics-aware IL which avoids adversarial training by learning a single Q-function, implicitly representing both reward and policy. On standard benchmarks, the implicitly learned rewards show a high positive correlation with the ground-truth rewards, illustrating our method can also be used for inverse reinforcement learning (IRL). Our method, Inverse soft-Q learning (IQ-Learn) obtains state-of-the-art results in offline and online imitation learning settings, surpassing existing methods both in the number of required environment interactions and scalability in high-dimensional spaces.
翻译:在许多顺序决策问题(例如机器人控制、游戏游戏、连续预测)中,人类或专家数据可以提供,其中载有关于这项任务的有用信息;然而,在具有复杂动态的高度环境中,从少量专家数据进行模仿学习(IL)可能具有挑战性;行为克隆是一种简单的方法,由于执行的简单性和稳定的趋同性而广泛使用,但不利用任何涉及环境动态的信息;由于对奖赏和政策相近者或偏颇、高差异梯度的偏差的对称优化程序,利用动态信息的许多现有方法在实践中难以培训。我们引入了动态觉悟的IL方法,通过学习单一的Q功能来避免对抗性培训,隐含着奖赏和政策。在标准基准中,隐含的所得奖励表明与地面图象奖的高度正相关关系,表明我们的方法也可以用于反向强化学习(IRL)。我们的方法是,在离线和在线模拟学习环境中,在高空间和在线学习中获取了最新的艺术结果。