The transformer architecture and variants presented remarkable success across many machine learning tasks in recent years. This success is intrinsically related to the capability of handling long sequences and the presence of context-dependent weights from the attention mechanism. We argue that these capabilities suit the central role of a Meta-Reinforcement Learning algorithm. Indeed, a meta-RL agent needs to infer the task from a sequence of trajectories. Furthermore, it requires a fast adaptation strategy to adapt its policy for a new task -- which can be achieved using the self-attention mechanism. In this work, we present TrMRL (Transformers for Meta-Reinforcement Learning), a meta-RL agent that mimics the memory reinstatement mechanism using the transformer architecture. It associates the recent past of working memories to build an episodic memory recursively through the transformer layers. We show that the self-attention computes a consensus representation that minimizes the Bayes Risk at each layer and provides meaningful features to compute the best actions. We conducted experiments in high-dimensional continuous control environments for locomotion and dexterous manipulation. Results show that TrMRL presents comparable or superior asymptotic performance, sample efficiency, and out-of-distribution generalization compared to the baselines in these environments.
翻译:变压器架构和变异器近年来在许多机器学习任务中表现出了显著的成功。 这一成功与处理长序列的能力以及关注机制中存在基于背景的权重有着内在的联系。 我们认为,这些能力适合元增强学习算法的核心作用。 事实上, 元RL 代理器需要从一系列轨迹中推断出任务。 此外, 它需要快速适应战略来调整其政策以适应一项新任务 -- -- 可以通过自省机制实现。 在这项工作中, 我们展示了TrMRL(Met-Reginment Learnings for Meta-Regining), 这是一种利用变压器结构模拟记忆恢复机制的元RL 代理器。 它将最近的工作记忆记录与通过变压层循环来建立感性记忆。 我们显示,自我保护是一种共识代表,可以最大限度地减少每一层的巴伊斯风险,并为计算最佳行动提供有意义的特征。 我们在高维度连续控制环境中进行实验, 以移动和变压方式学习, 并展示了这些变压的测试性环境。 结果显示, TMRML 的基线显示, 具有可比较性环境。