Fully-parametric language models generally require a huge number of model parameters to store the necessary knowledge for solving multiple natural language tasks in zero/few-shot settings. In addition, it is hard to adapt to the evolving world knowledge without the costly model re-training. In this paper, we develop a novel semi-parametric language model architecture, Knowledge-in-Context (KiC), which empowers a parametric text-to-text language model with a knowledge-rich external memory. Specifically, the external memory contains six different types of knowledge: entity, dictionary, commonsense, event, script, and causality knowledge. For each input instance, the KiC model adaptively selects a knowledge type and retrieves the most helpful pieces of knowledge. The input instance along with its knowledge augmentation is fed into a text-to-text model (e.g., T5) to generate the output answer, where both the input and the output are in natural language forms after prompting. Interestingly, we find that KiC can be identified as a special mixture-of-experts (MoE) model, where the knowledge selector plays the role of a router that is used to determine the sequence-to-expert assignment in MoE. This key observation inspires us to develop a novel algorithm for training KiC with an instance-adaptive knowledge selector. As a knowledge-rich semi-parametric language model, KiC only needs a much smaller parametric part to achieve superior zero-shot performance on unseen tasks. By evaluating on 40+ different tasks, we show that KiC_Large with 770M parameters easily outperforms large language models (LMs) that are 4-39x larger by a large margin. We also demonstrate that KiC exhibits emergent abilities at a much smaller model scale compared to the fully-parametric models.
翻译:完全参数语言模型通常需要大量模型参数,以存储在零/下截断环境中解决多种自然语言任务的必要知识。 此外,如果没有昂贵的模型再培训, KIC 模型很难适应不断演变的世界知识。 在本文中,我们开发了一个新型半参数语言模型架构,即“知识内连接(KIC),它赋予了一种具有丰富知识的外部记忆的文本到文字的参数模式。具体地说,外部记忆包含六种不同的知识类型:实体、字典、公元、事件、脚本和因果关系知识。对于每个输入实例,KIC 模型都适应地选择了一种知识类型,并检索了最有用的知识。输入实例及其知识增强被注入到一个文本到文本的模型(例如,T5) 以生成输出答案,其中输入的文本和输出在提示后以自然语言形式出现。有意思的是,我们发现KiC 模型可以被识别为一种简单的混合专家(MoE) 模型,其中,知识选择者将发挥一种高级观测功能,而将一个大的路径- 智能模型用来显示一种高级智能序列。</s>