High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. Auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. When assisted by our design, A2C improves on 4 games in the Atari environment with sparse rewards, and requires 1000x less training frames compared to the previous SOTA Agent 57 on Skiing, the hardest game in Atari.
翻译:长期以来,样本复杂程度一直是RL面临的一个挑战。另一方面,人类不仅从互动或演示中学会执行任务,而且通过阅读非结构化文本文件,例如指导手册;教学手册和维基页面是最丰富的数据,可以为有价值的特征和政策或特定任务的环境动态和奖赏结构的代理商提供信息;因此,我们假设,利用人写教学手册协助为具体任务制定学习政策的能力应导致一个更有效和更高效的代理商。我们建议了阅读和奖励框架。阅读阿塔里游戏开发者发布的手册,从而在阿塔里游戏上阅读和奖励,加速了阿塔里游戏上的RL算法。我们的框架包括一个“QA Expliton”模块,该模块根据手册中的信息摘录和摘要相关信息,以及一个根据手册中的信息评价对象-代理人互动和奖赏结构的逻辑模块。然后,在检测到互动时,向标准A2C RL代理提供辅助奖。当我们设计时,A2C改进了阿塔里环境中4场比赛的 RL 算法,其奖赏微小,并且要求最难的SKATA57的训练框框框比最难的STIATATA。