Reasoning in a complex and ambiguous environment is a key goal for Reinforcement Learning (RL) agents. While some sophisticated RL agents can successfully solve difficult tasks, they require a large amount of training data and often struggle to generalize to new unseen environments and new tasks. On the other hand, Large Scale Language Models (LSLMs) have exhibited strong reasoning ability and the ability to to adapt to new tasks through in-context learning. However, LSLMs do not inherently have the ability to interrogate or intervene on the environment. In this work, we investigate how to combine these complementary abilities in a single system consisting of three parts: a Planner, an Actor, and a Reporter. The Planner is a pre-trained language model that can issue commands to a simple embodied agent (the Actor), while the Reporter communicates with the Planner to inform its next command. We present a set of tasks that require reasoning, test this system's ability to generalize zero-shot and investigate failure cases, and demonstrate how components of this system can be trained with reinforcement-learning to improve performance.
翻译:在一个复杂和模糊的环境中,理论是强化学习(RL)代理人的一个关键目标。虽然一些先进的RL代理人能够成功地解决困难的任务,但他们需要大量的培训数据,而且往往难以推广到新的看不见的环境和新的任务。另一方面,大型语言模型(LSLsMs)表现出很强的推理能力,并且有能力通过内文学习适应新的任务。然而,LLSL本身并不有能力对环境进行质询或干预。在这项工作中,我们调查如何将这些互补能力结合到一个由三个部分组成的单一系统中:规划者、行为者和记者。Planner是一个经过预先训练的语言模型,可以向一个简单化的代理(Actionor)发出指令,而Reporter则与规划者沟通,以告知其下一个指令。我们提出了一系列需要推理的任务,测试这个系统对零镜头进行概括和调查失败案例的能力,并表明如何用强化学习来训练这个系统的各个组成部分,以改进业绩。