To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training. We demonstrate Memo's effectiveness on a gridworld meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints.
翻译:为使具身智能体能够在长时间范围内有效运作,开发能够形成并访问记忆以保持环境情境感知的模型至关重要。在当前基于Transformer策略训练具身序列决策任务的范式下,视觉输入常常超出Transformer的上下文限制,而人类能够维持并利用压缩为记忆的终身经验。理论上大幅压缩是可行的,因为大部分输入信息无关紧要且可被抽象化。然而,现有方法主要集中于固定容量记忆的循环模型或完全依赖上下文的Transformer架构。本工作提出Memo——一种针对内存密集型长时程任务的强化学习Transformer架构与训练方案。Memo通过在训练期间向模型输入中插入周期性摘要标记来实现记忆的创建与检索。我们在网格世界元强化学习基准测试和照片级真实室内环境的多目标导航任务中验证了Memo的有效性。Memo在保持更高计算与存储效率的同时,其性能优于原始长上下文Transformer基线模型。此外,Memo在推理时对更长上下文具有更好的泛化能力,并在必须截断历史上下文以适应推理约束的流式场景中保持鲁棒性。