Hindsight goal relabeling has become a foundational technique in multi-goal reinforcement learning (RL). The essential idea is that any trajectory can be seen as a sub-optimal demonstration for reaching its final state. Intuitively, learning from those arbitrary demonstrations can be seen as a form of imitation learning (IL). However, the connection between hindsight goal relabeling and imitation learning is not well understood. In this paper, we propose a novel framework to understand hindsight goal relabeling from a divergence minimization perspective. Recasting the goal reaching problem in the IL framework not only allows us to derive several existing methods from first principles, but also provides us with the tools from IL to improve goal reaching algorithms. Experimentally, we find that under hindsight relabeling, Q-learning outperforms behavioral cloning (BC). Yet, a vanilla combination of both hurts performance. Concretely, we see that the BC loss only helps when selectively applied to actions that get the agent closer to the goal according to the Q-function. Our framework also explains the puzzling phenomenon wherein a reward of (-1, 0) results in significantly better performance than a (0, 1) reward for goal reaching.
翻译:在多目标强化学习(RL)中,前视目标的重新标签已成为一个基础技术。 基本的想法是,任何轨迹都可被视为达到其最终状态的亚最佳示范。 自然地,从这些任意的演示中学习可被视为模仿学习(IL)的一种形式。 但是,后见目标的重新标签和模仿学习之间的联系没有被很好地理解。 在本文中,我们提出了一个新的框架来理解后视目标从分歧最小化的角度重新标签。 重新定位IL框架中的目标达到问题不仅使我们能够从最初的原则中获取几种现有方法,而且还为我们提供了从IL获得改进目标算法的工具。 实验性地,我们发现在后视重新标记、 Q 学习超越行为克隆( BC) 下, 两者的范式组合伤害了业绩。 具体地说, 我们发现, BC 损失只有在有选择地应用到使代理人更接近目标的行动时才会有所帮助。 我们的框架还解释了从IL 中改进业绩的奖赏率现象( 1) 。