We propose a new framework for imitation learning -- treating imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to satisfy pairwise performance rankings between behaviors, while the policy agent learns to maximize this reward. In imitation learning, near-optimal expert data can be difficult to obtain, and even in the limit of infinite data cannot imply a total ordering over trajectories as preferences can. On the other hand, learning from preferences alone is challenging as a large number of preferences are required to infer a high-dimensional reward function, though preference data is typically much easier to collect than expert demonstrations. The classical inverse reinforcement learning (IRL) formulation learns from expert demonstrations but provides no mechanism to incorporate learning from offline preferences and vice versa. We instantiate the proposed ranking-game framework with a novel ranking loss giving an algorithm that can simultaneously learn from expert demonstrations and preferences, gaining the advantages of both modalities. Our experiments show that the proposed method achieves state-of-the-art sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (LfO) setting. Project video and code can be found at https://hari-sikchi.github.io/rank-game/
翻译:我们提出了一个模仿学习的新框架 -- -- 将模仿作为政策与奖励之间的双玩排名游戏。在这个游戏中,奖赏机构学会满足行为之间的双向业绩排名,而政策机构则学会尽量扩大这种奖赏。在模仿学习中,几乎最佳的专家数据可能难以获得,甚至在无限数据的限制中,也不可能像偏好那样对轨迹进行完全排序。另一方面,仅仅从偏好中学习是一个挑战性的问题,因为需要大量偏好来推导高维度的奖赏功能,尽管偏好数据通常比专家演示更容易收集。典型的反向强化学习(IRL)的提法从专家演示中学习,但没有提供机制将离线偏好和反之学习纳入其中。我们用新的排名损失来回拨拟议的排名框架,提供一种算法,既可以从专家的演示和偏好中学习,又获得两种模式的优势。我们的实验表明,拟议的方法达到了最先进的样效率,可以比专家演示所收集的要容易得多。典型的反向强化学习的学习方法(IRLL)从专家演示中学习,但没有机制。我们从观察中找到的MA-hari/O 设置的视频代码。我们可以在观察/ROK-IK-O设置中找到的视频/Procard/Procard 。