Hindsight goal relabeling has become a foundational technique for multi-goal reinforcement learning (RL). The idea is quite simple: any arbitrary trajectory can be seen as an expert demonstration for reaching the trajectory's end state. Intuitively, this procedure trains a goal-conditioned policy to imitate a sub-optimal expert. However, this connection between imitation and hindsight relabeling is not well understood. Modern imitation learning algorithms are described in the language of divergence minimization, and yet it remains an open problem how to recast hindsight goal relabeling into that framework. In this work, we develop a unified objective for goal-reaching that explains such a connection, from which we can derive goal-conditioned supervised learning (GCSL) and the reward function in hindsight experience replay (HER) from first principles. Experimentally, we find that despite recent advances in goal-conditioned behaviour cloning (BC), multi-goal Q-learning can still outperform BC-like methods; moreover, a vanilla combination of both actually hurts model performance. Under our framework, we study when BC is expected to help, and empirically validate our findings. Our work further bridges goal-reaching and generative modeling, illustrating the nuances and new pathways of extending the success of generative models to RL.
翻译:后视目标的重新标签已成为多目标强化学习的基础技术(RL) 。 想法很简单: 任何任意的轨迹都可以视为达到轨迹终点的专家演示。 直观地说, 这个程序可以培养一种目标限制的政策来模仿亚最佳专家。 但是, 模仿和后视重贴标签之间的这种联系并没有得到很好的理解。 现代模仿学习算法用差异最小化的语言描述, 但它仍然是如何将后视目标重新贴入这个框架的一个公开问题 。 在这项工作中, 我们为实现目标制定了一个统一的目标, 从而解释这种联系, 我们可以从中获得有目标限制的监管学习(GCSL)和奖励功能, 从头等原则的后视经验重现(HER) 。 我们实验地发现, 尽管最近在受目标限制的行为克隆(BC)方面有所进步, 多目标的Q学习仍然可以超越像BC一样的方法; 此外, 两种模式的业绩实际上都受到了伤害。 在我们的框架下, 我们研究在BC工作过程中, 当我们的基因变异性研究过程中, 当我们将我们的基因变异性目标推延时, 当我们的工作和实验性研究时, 当我们的基因变近的桥梁, 我们的桥梁将我们的基因变近。