Language model (LM) pre-training is useful in many language processing tasks. But can pre-trained LMs be further leveraged for more general machine learning problems? We propose an approach for using LMs to scaffold learning and generalization in general sequential decision-making problems. In this approach, goals and observations are represented as a sequence of embeddings, and a policy network initialized with a pre-trained LM predicts the next action. We demonstrate that this framework enables effective combinatorial generalization across different environments and supervisory modalities. We begin by assuming access to a set of expert demonstrations, and show that initializing policies with LMs and fine-tuning them via behavior cloning improves task completion rates by 43.6% in the VirtualHome environment. Next, we integrate an active data gathering procedure in which agents iteratively interact with the environment, relabel past "failed" experiences with new goals, and update their policies in a self-supervised loop. Active data gathering further improves combinatorial generalization, outperforming the best baseline by 25.1%. Finally, we explain these results by investigating three possible factors underlying the effectiveness of the LM-based policy. We find that sequential input representations (vs. fixed-dimensional feature vectors) and LM-based weight initialization are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e.g. as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing.
翻译:语言模式 (LM) 培训前模式在许多语言处理任务中有用 。 但是, 培训前LM 能够进一步被进一步用于更普遍的机器学习问题 。 我们提出一种方法, 使用LM 来在一般的顺序决策问题中进行速成学习和概括。 在这种方法中, 目标和观察被作为嵌入的序列来代表, 以经过预先训练的LM 开始的政策网络可以预测下一个行动 。 我们证明, 这个框架可以在不同环境和监督模式中进行有效的组合概括化 。 我们首先假设使用一组专家演示, 并显示与LM 一起启动政策, 并通过行为克隆来微调这些政策, 使得任务完成率在虚拟Home环境中提高43.6%。 接下来, 我们整合一个积极的数据收集程序, 代理与环境互动, 将过去的“ 失败” 经验与新目标重新贴上标签, 并以自我控制的循环方式更新其政策 。 积极收集的数据模式可以进一步改进组合的概括化, 超过最佳的基线 25.1 %。 最后, 我们解释这些结果, 我们通过调查三种可能的因素, 基础的自然代表 初步的 格式 格式 格式 格式 的 的 格式 的 的 的 格式 的 的 的 的 的 的 的 的 格式