Large language models (LMs) such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. The LM learns from these examples without being explicitly pretrained to learn. Thus, it is unclear what enables in-context learning. In this paper, we study how in-context learning can emerge when pretraining documents have long-range coherence. Here, the LM must infer a latent document-level concept to generate coherent next tokens during pretraining. At test time, in-context learning occurs when the LM also infers a shared latent concept between examples in a prompt. We prove when this occurs despite a distribution mismatch between prompts and pretraining data in a setting where the pretraining distribution is a mixture of HMMs. In contrast to messy large-scale datasets used to train LMs capable of in-context learning, we generate a small-scale synthetic dataset (GINC) where Transformers and LSTMs both exhibit in-context learning. Beyond the theory, experiments on GINC exhibit large-scale real-world phenomena including improved in-context performance with model scaling (despite the same pretraining loss), sensitivity to example order, and instances where zero-shot is better than few-shot in-context learning.
翻译:GPT-3等大型语言模型(LMS)具有令人惊讶的在文字上学习的能力,而该模型仅靠由投入产出实例组成的快速范例来学习,就学会了下游任务。LMS从这些实例中学习,而没有经过明确的培训学习。因此,不清楚是什么使得在文字上学习。在本文中,我们研究在训练前文件具有长期一致性时,如何出现在文字上学习。在这里,LM必须推导一种潜在的文件级概念,以便在培训前产生一致的下一个标志。在测试时,当LM还推介一个快速实例之间的共同潜在概念时,就会发生文字上学习。我们证明,尽管在培训前的分布是HMMM的零混合的环境下,在这种环境中,在提示和预培训前的数据之间分配不匹配。与培训前用于培训LMS的大规模数据集相比,我们产生了一种小规模的合成数据集(GINC),其中变换器和LSTMS都出现在文字上很少的模型中,除了理论上、在GINC系统上进行更好的学习外,在理论上进行更好的实验,在实际损失顺序上进行更好的实验。