We propose a new framework for imitation learning - treating imitation as a two-player ranking-based Stackelberg game between a $\textit{policy}$ and a $\textit{reward}$ function. In this game, the reward agent learns to satisfy pairwise performance rankings within a set of policies, while the policy agent learns to maximize this reward. This game encompasses a large subset of both inverse reinforcement learning (IRL) methods and methods which learn from offline preferences. The Stackelberg game formulation allows us to use optimization methods that take the game structure into account, leading to more sample efficient and stable learning dynamics compared to existing IRL methods. We theoretically analyze the requirements of the loss function used for ranking policy performances to facilitate near-optimal imitation learning at equilibrium. We use insights from this analysis to further increase sample efficiency of the ranking game by using automatically generated rankings or with offline annotated rankings. Our experiments show that the proposed method achieves state-of-the-art sample efficiency and is able to solve previously unsolvable tasks in the Learning from Observation (LfO) setting.
翻译:我们提出了一个模仿学习的新框架, 将模仿作为基于双玩者排名的Stackelberg游戏, 在 $\ textit{ policy} $ 和 $\ textit{ reward}$ 函数之间, 以双人排名为基础。 在这个游戏中, 奖赏代理商学会在一套政策中满足双向业绩排名, 而政策代理商则学会尽量扩大这一奖赏。 这个游戏包含从离线偏好中学习的反向强化学习( IRL) 方法和方法的一大批子。 Stackelberg 游戏的配方允许我们使用最优化的方法, 以考虑到游戏结构, 导致比现有的 IRL 方法更高效和稳定的学习动态 。 我们从理论上分析了排序政策性业绩中损失函数的要求, 以便利在均衡地进行近最佳的模仿学习。 我们从这一分析中获取的洞察, 通过使用自动生成的排名或离线附加的排名来进一步提高排名来进一步提高排名的排序的样本效率。 我们的实验显示, 拟议的方法达到了最先进的样本效率, 能够解决从观察( LfO) 中学习中先前无法解决的任务。