关于学习人类反馈的MineRL BASALT竞赛 (The MineRL BASALT Competition on Learning from Human Feedback)

Rohin Shah,Cody Wild,Steven H. Wang,Neel Alex,Brandon Houghton,William Guss,Sharada Mohanty,Anssi Kanervisto,Stephanie Milani,Nicholay Topin,Pieter Abbeel,Stuart Russell,Anca Dragan

from arxiv, NeurIPS 2021 Competition Track

The last decade has seen a significant increase of interest in deep learning research, with many public successes that have demonstrated its potential. As such, these systems are now being incorporated into commercial products. With this comes an additional challenge: how can we build AI systems that solve tasks where there is not a crisp, well-defined specification? While multiple solutions have been proposed, in this competition we focus on one in particular: learning from human feedback. Rather than training AI systems using a predefined reward function or using a labeled dataset with a predefined set of categories, we instead train the AI system using a learning signal derived from some form of human feedback, which can evolve over time as the understanding of the task changes, or as the capabilities of the AI system improve. The MineRL BASALT competition aims to spur forward research on this important class of techniques. We design a suite of four tasks in Minecraft for which we expect it will be hard to write down hardcoded reward functions. These tasks are defined by a paragraph of natural language: for example, "create a waterfall and take a scenic picture of it", with additional clarifying details. Participants must train a separate agent for each task, using any method they want. Agents are then evaluated by humans who have read the task description. To help participants get started, we provide a dataset of human demonstrations on each of the four tasks, as well as an imitation learning baseline that leverages these demonstrations. Our hope is that this competition will improve our ability to build AI systems that do what their designers intend them to do, even when the intent cannot be easily formalized. Besides allowing AI to solve more tasks, this can also enable more effective regulation of AI systems, as well as making progress on the value alignment problem.

翻译：过去十年来,人们对深层次的学习研究的兴趣明显增加,许多公众的成功都表明了其潜力。因此,这些系统现在正在被融入商业产品中。随着这个挑战的出现,我们如何建立AI系统,解决没有精确和明确界定的规格的任务?虽然提出了多种解决办法,但在这一竞争中,我们特别侧重于一个问题:从人类反馈中学习。我们不是用预先定义的奖励功能或使用带有预先界定的一组类别的标签数据集来培训AI系统,而是用从某种形式的人类反馈中获得的学习信号来培训AI系统,随着对任务变化的理解或AI系统能力的提高,这些系统可以随着时间的演变而演变。MineRL BASALT竞争的目的是推动对这一重要技术类别进行前瞻性研究。我们设计了一套由四种任务组成的系列任务,为此我们期望很难写下硬编码的奖赏功能。这些任务由自然语言的一段来界定:例如,“创造水流和摄取其精度的图像”等,而这种信号可以随着时间的变化而演变,随着对任务的理解而演变,随着对任务的理解而变化,随着对任务的理解,随着对任务能力的改变,随着对任务的理解,随着对任务的理解,随着对任务能力的改进而变化的能力或AI系统能力的改进而不断演变而演变而演变而演变而演变而演变而演变而演变。参与者必须训练为一种独立的一种单独的来,为了对每一项,为了用一种单独的,为了对每一项而进行一种单独的一种不同的解释。为了用一种不同的解释而训练一个单独的解释。