Learning to flexibly follow task instructions in dynamic environments poses interesting challenges for reinforcement learning agents. We focus here on the problem of learning control flow that deviates from a strict step-by-step execution of instructions -- that is, control flow that may skip forward over parts of the instructions or return backward to previously completed or skipped steps. Demand for such flexible control arises in two fundamental ways: explicitly when control is specified in the instructions themselves (such as conditional branching and looping) and implicitly when stochastic environment dynamics require re-completion of instructions whose effects have been perturbed, or opportunistic skipping of instructions whose effects are already present. We formulate an attention-based architecture that meets these challenges by learning, from task reward only, to flexibly attend to and condition behavior on an internal encoding of the instructions. We test the architecture's ability to learn both explicit and implicit control in two illustrative domains -- one inspired by Minecraft and the other by StarCraft -- and show that the architecture exhibits zero-shot generalization to novel instructions of length greater than those in a training set, at a performance level unmatched by two baseline recurrent architectures and one ablation architecture.
翻译:动态环境中灵活学习任务指令给强化学习代理人带来有趣的挑战。我们在此侧重于学习控制流动问题,它不同于严格一步步执行指令,即控制流动可能跳过指令的某些部分,或者返回到以前完成或跳过的步骤。这种灵活控制需求有两种基本方式:当指令本身(例如有条件的分支和循环)明确规定控制时,这种灵活控制需求就会产生,当随机环境动态需要重新完成其影响已经受到干扰的指示,或者偶然跳过其影响已经存在的指令时,这种学习控制流动就会产生问题。我们形成了一种基于关注的架构,通过学习,从任务奖励到灵活地处理指令的内部编码,来应对这些挑战。我们测试该架构是否有能力在两个说明性领域(一个是来自Minecraft的,另一个是StarCraft的)中学习明确和隐含的控制,并显示该架构在业绩水平上展示了比培训成套标准更长的新指示零光的概观,由两个基线经常性架构和一个断层不匹配。