Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.
翻译:基金会模式在受监督和自我监督的学习问题中表现出了令人印象深刻的适应性和可扩展性,但迄今为止,这些成功尚未完全转化为强化学习(RL)。在这项工作中,我们证明,在规模上培训一名RL代理人员可导致一种一般的内流学习算法,这种算法能够像人类一样快速地适应包含3D问题的开放性新颖的3D问题。在巨大的屏蔽环境动态空间中,我们的适应性代理(AdA)在飞行假想驱动的探索上展示了对所获知识的高效利用,并且可以通过第一人演示成功地激发这些成功。适应性来自三个要素:(1) 在广泛、平稳和多样的任务分配中进行元化强化学习,(2) 政策参数化为大规模关注性的记忆结构,(3) 有效的自动课程,在代理人能力的前沿进行任务。我们展示了有关网络规模、记忆长度和培训任务分配的丰富程度的典型法律。我们相信,我们的成果为日益普及和适应性RL代理人员奠定了基础。