Avalon:使用程序生成世界的RL普遍化基准 (Avalon: A Benchmark for RL Generalization Using Procedurally Generated Worlds)

Joshua Albrecht,Abraham J. Fetterman,Bryden Fogelman,Ellie Kitanidis,Bartosz Wróblewski,Nicole Seo,Michael Rosenthal,Maksis Knutins,Zachary Polizzi,James B. Simon,Kanjun Qiu

from arxiv, Accepted to NeurIPS Datasets and Benchmarks 2022. Video and links to all code, data, etc can be found at https://generallyintelligent.com/avalon/

Despite impressive successes, deep reinforcement learning (RL) systems still fall short of human performance on generalization to new tasks and environments that differ from their training. As a benchmark tailored for studying RL generalization, we introduce Avalon, a set of tasks in which embodied agents in highly diverse procedural 3D worlds must survive by navigating terrain, hunting or gathering food, and avoiding hazards. Avalon is unique among existing RL benchmarks in that the reward function, world dynamics, and action space are the same for every task, with tasks differentiated solely by altering the environment; its 20 tasks, ranging in complexity from eat and throw to hunt and navigate, each create worlds in which the agent must perform specific skills in order to survive. This setup enables investigations of generalization within tasks, between tasks, and to compositional tasks that require combining skills learned from previous tasks. Avalon includes a highly efficient simulator, a library of baselines, and a benchmark with scoring metrics evaluated against hundreds of hours of human performance, all of which are open-source and publicly available. We find that standard RL baselines make progress on most tasks but are still far from human performance, suggesting Avalon is challenging enough to advance the quest for generalizable RL.

翻译：尽管取得了令人印象深刻的成功,但深强化学习(RL)系统在普及到不同于其培训的新任务和环境方面,仍然达不到人类绩效;作为研究RL一般化的定制基准,我们引入了Avalon,这是一套任务,其中体现高度多样化的程序三维世界中的代理体必须经地貌、狩猎或采集食物和避免危险才能生存。Avalon是现有RL基准中独一无二的,因为奖励功能、世界动态和行动空间对每一项任务都是相同的,任务都只通过改变环境而有所区别;其20项任务,从吃饭和扔去打猎到航行等复杂程度不等,每个任务都创造出一个代理人必须具备特定技能才能生存的世界。这一设置使得能够调查任务、任务和任务之间以及需要将以往任务所学到的技能结合起来的构成任务之间的普遍化调查。Avalon包括一个高效的模拟器、一个基线图书馆以及一个参照数百小时人类业绩评估的评分标准基准,所有这些都是公开来源和公开提供的。我们发现,标准RL基准在大多数任务上都取得了进展,但距离人类业绩还远远远远远远远远远远远。