In this work we create agents that can perform well beyond a single, individual task, that exhibit much wider generalisation of behaviour to a massive, rich space of challenges. We define a universe of tasks within an environment domain and demonstrate the ability to train agents that are generally capable across this vast space and beyond. The environment is natively multi-agent, spanning the continuum of competitive, cooperative, and independent games, which are situated within procedurally generated physical 3D worlds. The resulting space is exceptionally diverse in terms of the challenges posed to agents, and as such, even measuring the learning progress of an agent is an open research problem. We propose an iterative notion of improvement between successive generations of agents, rather than seeking to maximise a singular objective, allowing us to quantify progress despite tasks being incomparable in terms of achievable rewards. We show that through constructing an open-ended learning process, which dynamically changes the training task distributions and training objectives such that the agent never stops learning, we achieve consistent learning of new behaviours. The resulting agent is able to score reward in every one of our humanly solvable evaluation levels, with behaviour generalising to many held-out points in the universe of tasks. Examples of this zero-shot generalisation include good performance on Hide and Seek, Capture the Flag, and Tag. Through analysis and hand-authored probe tasks we characterise the behaviour of our agent, and find interesting emergent heuristic behaviours such as trial-and-error experimentation, simple tool use, option switching, and cooperation. Finally, we demonstrate that the general capabilities of this agent could unlock larger scale transfer of behaviour through cheap finetuning.
翻译:在这项工作中,我们创建了能够远远超越单一、单项任务的代理商,这些代理商表现出更加广泛的行为概括性,并展示了巨大的、丰富的挑战空间。我们定义了环境领域范围内的任务范围,并展示了培训在这一广阔空间内外普遍有能力的代理商的能力。环境是本地的多种代理商,跨越竞争、合作和独立的游戏的连续体,这些游戏都位于程序上产生的有形的3D世界中。由此产生的空间在代理商所面临的挑战方面差异极大,因此,即使衡量一个代理商的学习进展也是一个开放的研究问题。我们提出了一种反复的概念,即:在更广大的代理商之间作出改进,而不是寻求实现一个单一的目标,让我们能够量化进展,尽管这些任务在可实现的回报方面是比较的。我们通过构建一个开放的学习过程,从而动态地改变培训任务的分配和培训目标,从而使得代理商永远不停止学习,我们从整体上学习新的行为。由此形成的代理商能够在我们人类可选择的评价级别的每一级别中得分级的奖赏。我们提出了一个反复的理念概念,在更广大的代理商之间作出一个反复的评分级的评分,而不是最大化的改进一个单一的目标,让我们在展示的实验中,在展示中进行着的实验中进行着的动作上的行为, 以及整个的实验中进行着地分析。