Prior work has proposed a simple strategy for reinforcement learning (RL): label experience with the outcomes achieved in that experience, and then imitate the relabeled experience. These outcome-conditioned imitation learning methods are appealing because of their simplicity, strong performance, and close ties with supervised learning. However, it remains unclear how these methods relate to the standard RL objective, reward maximization. In this paper, we prove that existing outcome-conditioned imitation learning methods do not necessarily improve the policy; rather, in some settings they can decrease the expected reward. Nonetheless, we show that a simple modification results in a method that does guarantee policy improvement, under some assumptions. Our aim is not to develop an entirely new method, but rather to explain how a variant of outcome-conditioned imitation learning can be used to maximize rewards.
翻译:先前的工作提出了加强学习的简单战略(RL):将经验与从这一经验中取得的结果贴上标签,然后仿照重新标签的经验。这些以成果为条件的模仿学习方法由于其简单、业绩强和与监督学习的密切联系而具有吸引力。然而,目前尚不清楚这些方法与标准RL目标有何关系,奖励最大化。在本文件中,我们证明现有的以成果为条件的模仿学习方法并不一定能改善政策;相反,在某些环境下,它们可以减少预期的奖励。然而,我们表明,根据一些假设,简单修改可以导致一种保证政策改进的方法。我们的目的不是要开发一种全新的方法,而是要解释如何利用以成果为条件的模仿学习的变式来最大限度地获得奖励。