An important goal in artificial intelligence is to create agents that can both interact naturally with humans and learn from their feedback. Here we demonstrate how to use reinforcement learning from human feedback (RLHF) to improve upon simulated, embodied agents trained to a base level of competency with imitation learning. First, we collected data of humans interacting with agents in a simulated 3D world. We then asked annotators to record moments where they believed that agents either progressed toward or regressed from their human-instructed goal. Using this annotation data we leveraged a novel method - which we call "Inter-temporal Bradley-Terry" (IBT) modelling - to build a reward model that captures human judgments. Agents trained to optimise rewards delivered from IBT reward models improved with respect to all of our metrics, including subsequent human judgment during live interactions with agents. Altogether our results demonstrate how one can successfully leverage human judgments to improve agent behaviour, allowing us to use reinforcement learning in complex, embodied domains without programmatic reward functions. Videos of agent behaviour may be found at https://youtu.be/v_Z9F2_eKk4.
翻译:人工智能的一个重要目标是创建既能自然地与人类互动又能从其反馈中学习的代理商。 我们在这里演示如何利用从人类反馈(RLHF)中强化学习的方法来改进模拟的、经过模拟培训的装饰代理商,使其具备模仿学习的基本能力。 首先,我们收集了在模拟的3D世界中与代理商互动的人类数据。 然后,我们请通知员记录他们认为代理商要么进步到或从其人的指令目标中退步的时刻。 我们利用了这种注解数据,我们利用了一种新颖的方法 — 我们称之为“跨时布拉德-泰里”建模(IBTT) — 来建立一种捕捉人类判断的奖赏模式。 被培训的代理商们将IBT奖励模型的奖赏改进了我们所有衡量标准,包括随后在与代理商进行现场互动时的人类判断。 我们综合的结果表明,人们如何成功地利用人类判断来改进代理商行为,从而使我们能够在复杂、含意的领域中使用强化学习,而没有方案奖励功能。 代理商行为的视频可以在 https://yotube/v9F2_ek4中找到。