Recent works have shown that sequence modeling can be effectively used to train reinforcement learning (RL) policies. However, the success of applying existing sequence models to planning, in which we wish to obtain a trajectory of actions to reach some goal, is less straightforward. The typical autoregressive generation procedures of sequence models preclude sequential refinement of earlier steps, which limits the effectiveness of a predicted plan. In this paper, we suggest an approach towards integrating planning with sequence models based on the idea of iterative energy minimization, and illustrate how such a procedure leads to improved RL performance across different tasks. We train a masked language model to capture an implicit energy function over trajectories of actions, and formulate planning as finding a trajectory of actions with minimum energy. We illustrate how this procedure enables improved performance over recent approaches across BabyAI and Atari environments. We further demonstrate unique benefits of our iterative optimization procedure, involving new task generalization, test-time constraints adaptation, and the ability to compose plans together. Project website: https://hychen-naza.github.io/projects/LEAP
翻译:近期的研究表明,序列建模可以有效地用于训练强化学习 (RL) 策略。然而,将现有的序列模型应用于计划时(我们希望获得一条行动轨迹以达到某个目标),成功性不太明显。序列模型的典型自回归生成过程排除了对先前步骤的顺序细化,这限制了预测计划的有效性。本文提出了一种将规划与序列模型结合起来的方法,基于迭代能量最小化的思想,说明了这种过程如何在不同任务的 RL 性能方面带来改进。我们训练一个掩码语言模型来捕获动作轨迹上的隐式能量函数,并将规划制定为寻找具有最小能量的动作轨迹。我们说明了这个过程如何使得在 BabyAI 和 Atari 环境中取得改进的性能。我们进一步展示了我们的迭代优化过程的独特优点,包括新的任务泛化、测试时的约束适应以及拼接计划的能力。项目网站:https://hychen-naza.github.io/projects/LEAP