2019年利用人类前科抽样有效强化学习竞赛 (The MineRL 2019 Competition on Sample Efficient Reinforcement Learning using Human Priors)

William H. Guss,Cayden Codel,Katja Hofmann,Brandon Houghton,Noboru Kuno,Stephanie Milani,Sharada Mohanty,Diego Perez Liebana,Ruslan Salakhutdinov,Nicholay Topin,Manuela Veloso,Phillip Wang

from arxiv, accepted at NeurIPS 2019, 28 pages

Though deep reinforcement learning has led to breakthroughs in many difficult domains, these successes have required an ever-increasing number of samples. As state-of-the-art reinforcement learning (RL) systems require an exponentially increasing number of samples, their development is restricted to a continually shrinking segment of the AI community. Likewise, many of these systems cannot be applied to real-world problems, where environment samples are expensive. Resolution of these limitations requires new, sample-efficient methods. To facilitate research in this direction, we introduce the MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors. The primary goal of the competition is to foster the development of algorithms which can efficiently leverage human demonstrations to drastically reduce the number of samples needed to solve complex, hierarchical, and sparse environments. To that end, we introduce: (1) the Minecraft ObtainDiamond task, a sequential decision making environment requiring long-term planning, hierarchical control, and efficient exploration methods; and (2) the MineRL-v0 dataset, a large-scale collection of over 60 million state-action pairs of human demonstrations that can be resimulated into embodied trajectories with arbitrary modifications to game state and visuals. Participants will compete to develop systems which solve the ObtainDiamond task with a limited number of samples from the environment simulator, Malmo. The competition is structured into two rounds in which competitors are provided several paired versions of the dataset and environment with different game textures. At the end of each round, competitors will submit containerized versions of their learning algorithms and they will then be trained/evaluated from scratch on a hold-out dataset-environment pair for a total of 4-days on a prespecified hardware platform.

翻译：虽然深层加固学习在许多困难领域带来了突破,但这些成功要求越来越多的样本。由于最先进的强化学习(RL)系统需要的样本数量成倍增加,因此其开发仅限于AI社区中不断缩小的部分。同样,许多这些系统无法应用于环境样品昂贵的现实世界问题。解决这些局限性需要新的、抽样效率高的方法。为了便利这方面的研究,我们引入了MineRL关于利用人类前科进行抽样高效强化学习的MineRL圆回合竞赛。竞争的主要目的是促进算法的发展,这种算法能够有效地利用人类的游戏演示,大幅度减少解决复杂、等级和分散环境所需的样本数量。为此,我们引入:(1)Minecraft Diamond任务,一个需要长期规划、等级控制和高效勘探方法的顺序决策环境;(2)MineRL-V0数据集,一个大规模收集6 000多万份的州-行动组。它们随后可以被重新模拟成成成一个包含定型轨的轨迹的轨迹,以任意修改的方式向游戏和视觉环境提交所需的样本。参与者们将利用一个有一定的版本的版本的纸质变式学习。