Fully-parametric language models generally require a huge number of model parameters to store the necessary knowledge for solving multiple natural language tasks in zero/few-shot settings. In addition, it is hard to adapt to the evolving world knowledge without the costly model re-training. In this paper, we develop a novel semi-parametric language model architecture, Knowledge-in-Context (KiC), which empowers a parametric text-to-text language model with a knowledge-rich external memory. Specifically, the external memory contains six different types of knowledge: entity, dictionary, commonsense, event, script, and causality knowledge. For each input instance, the KiC model adaptively selects a knowledge type and retrieves the most helpful pieces of knowledge. The input instance along with its knowledge augmentation is fed into a text-to-text model (e.g., T5) to generate the output answer, where both the input and the output are in natural language forms after prompting. Interestingly, we find that KiC can be identified as a special mixture-of-experts (MoE) model, where the knowledge selector plays the role of a router that is used to determine the sequence-to-expert assignment in MoE. This key observation inspires us to develop a novel algorithm for training KiC with an instance-adaptive knowledge selector. As a knowledge-rich semi-parametric language model, KiC only needs a much smaller parametric part to achieve superior zero-shot performance on unseen tasks. By evaluating on 40+ different tasks, we show that KiC_Large with 770M parameters easily outperforms large language models (LMs) that are 4-39x larger by a large margin. We also demonstrate that KiC exhibits emergent abilities at a much smaller model scale compared to the fully-parametric models.
翻译:全参数语言模型通常需要大量的模型参数来存储必要的知识以解决多个自然语言任务,而且很难适应不断演变的世界知识,而无需进行昂贵的模型重新训练。在本文中,我们开发了一种新颖的半参数语言模型架构——知识上下文(KiC),它将参数化文本到文本语言模型与具有知识丰富的外部存储器相结合。具体而言,外部存储器包含六种不同类型的知识:实体、词典、常识、事件、脚本和因果关系知识。对于每个输入实例,KiC模型会自适应地选择一种知识类型,并检索出最有帮助的知识片段。将输入实例及其知识增强后,再通过文本到文本模型(e.g., T5)生成输出答案,其中输入和输出均为自然语言形式。有趣的是,我们发现KiC可以被识别为一种特殊的专家混合模型(MoE),其中知识选择器起到路由器的作用,用于确定MoE中序列到专家的分配方式。这一关键观察启发我们开发一种新算法,用于训练具有实例自适应知识选择器的KiC。作为一种具有知识丰富的半参数语言模型,KiC仅需一个小得多的参数化部分即可在看不见的任务上实现优越的零样本性能。通过对40多个不同的任务进行评估,我们展示了KiC_Large(具有7.7亿个参数)轻松地超过了那些4-39倍更大的大型语言模型(LMs)。我们还证明了KiC在比全参数模型小得多的模型规模下表现出了突现能力。