Language model (LM) pre-training is useful in many language processing tasks. But can pre-trained LMs be further leveraged for more general machine learning problems? We propose an approach for using LMs to scaffold learning and generalization in general sequential decision-making problems. In this approach, goals and observations are represented as a sequence of embeddings, and a policy network initialized with a pre-trained LM predicts the next action. We demonstrate that this framework enables effective combinatorial generalization across different environments and supervisory modalities. We begin by assuming access to a set of expert demonstrations, and show that initializing policies with LMs and fine-tuning them via behavior cloning improves task completion rates by 43.6% in the VirtualHome environment. We then examine how our framework may be used in environments without pre-collected expert data. To do this, we integrate an active data gathering procedure into pre-trained LMs. The agent iteratively learns by interacting with the environment, relabeling the language goal of past 'failed' experiences, and updating the policy in a self-supervised loop. The active data gathering procedure also enables effective combinatorial generalization, outperforming the best baseline by 25.1%. Finally, we explain these results by investigating three possible factors underlying the effectiveness of the LM-based policy. We find that sequential input representations (vs. fixed-dimensional feature vectors) and favorable weight initialization are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e.g. as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans.
翻译:语言模式 (LM) 预培训模式在许多语言处理任务中是有用的。 但是, 培训前的LM 能够进一步被进一步用于更普遍的机器学习问题 。 我们提出一种方法, 使用 LM 来在一般的顺序决策问题上进行速成学习和概括。 在这种方法中, 目标和观察被作为嵌入的序列, 并且一个政策网络, 由经过预先培训的LM 来启动, 预测下一步的行动 。 我们证明这个框架能够在不同的语言处理方式和监督模式中进行有效的组合概括化。 我们首先假设使用一组专家演示, 并显示与LM 一起启动政策, 并通过行为输入精度政策精细化来微调整这些政策 。 积极的数据采集程序也可以在没有预先收集专家数据的情况下在环境中使用自己的框架 。 要做到这一点, 我们将一个活跃的数据收集程序纳入经过预先培训的LM 。 通过与环境互动模式, 重新标注过去“ 精细” 经验的语言目标, 并且通过一个最佳的循环来更新政策 。