Language models have been shown to perform better with an increase in scale on a wide variety of tasks via the in-context learning paradigm. In this paper, we investigate the hypothesis that the ability of a large language model to in-context learn-perform a task is not uniformly spread across all of its underlying components. Using a 66 billion parameter language model (OPT-66B) across a diverse set of 14 downstream tasks, we find this is indeed the case: $\sim$70% of attention heads and $\sim$20% of feed forward networks can be removed with minimal decline in task performance. We find substantial overlap in the set of attention heads (un)important for in-context learning across tasks and number of in-context examples. We also address our hypothesis through a task-agnostic lens, finding that a small set of attention heads in OPT-66B score highly on their ability to perform primitive induction operations associated with in-context learning, namely, prefix matching and copying. These induction heads overlap with task-specific important heads, suggesting that induction heads are among the heads capable of more sophisticated behaviors associated with in-context learning. Overall, our study provides several insights that indicate large language models may be under-trained to perform in-context learning and opens up questions on how to pre-train language models to more effectively perform in-context learning.
翻译:在本文中,我们调查了这样一种假设,即大型语文模式的内通学习能力并非在所有基本组成部分中均匀分布。我们用一套660亿参数语言模式(OPT-66B)在14个下游任务中运用,我们发现情况确实如此:关注负责人70%和反馈前方网络20 % 的规模扩大,可以消除,任务绩效下降的程度最小。我们发现,一组关注负责人(对于在任务和数个同源实例之间进行同源学习至关重要)的大量重叠。我们还从任务-认知角度探讨我们的假设,发现OM-66B的少量关注负责人高度评价他们执行与文中学习有关的原始上岗作业的能力,即,前置匹配和复制。这些上岗负责人与具体任务的重要负责人重叠,表明上岗负责人是能够与在任务和同源中进行更复杂的学习的更精密行为中的负责人之一。我们通过一个任务-不可忽视的视角来学习前语言。