We propose a new framework for imitation learning -- treating imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to satisfy pairwise performance rankings between behaviors, while the policy agent learns to maximize this reward. In imitation learning, near-optimal expert data can be difficult to obtain, and even in the limit of infinite data cannot imply a total ordering over trajectories as preferences can. On the other hand, learning from preferences alone is challenging as a large number of preferences are required to infer a high-dimensional reward function, though preference data is typically much easier to collect than expert demonstrations. The classical inverse reinforcement learning (IRL) formulation learns from expert demonstrations but provides no mechanism to incorporate learning from offline preferences and vice versa. We instantiate the proposed ranking-game framework with a novel ranking loss giving an algorithm that can simultaneously learn from expert demonstrations and preferences, gaining the advantages of both modalities. Our experiments show that the proposed method achieves state-of-the-art sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (LfO) setting.
翻译:我们提出了一个模仿学习的新框架 -- -- 将模仿作为政策与奖赏之间的双玩排名游戏。在这个游戏中,奖赏机构学会满足行为之间的双向业绩排名,而政策机构则学会尽量扩大这种奖赏。在模仿学习中,几乎最佳的专家数据可能难以获得,甚至在无限数据的限制中,也不可能意味着像偏好那样对轨迹进行完全排序。另一方面,单从偏好中学习是一个挑战,因为需要大量偏好来推导高维度的奖赏功能,尽管偏好数据通常比专家演示更容易收集。典型的反向强化学习(IRL)的提法从专家演示中学习,但没有提供机制将离线偏好和反之学习纳入其中。我们用新的排序损失框架,给出一种可以同时从专家演示和偏好中学习的算法,从而获得两种模式的优势。我们的实验表明,拟议的方法能够达到最先进的样本效率,并能够解决从观察中学习(LfO)以往无法解决的任务。