Large pretrained language models such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. Without being explicitly pretrained to do so, the language model learns from these examples during its forward pass without parameter updates on "out-of-distribution" prompts. Thus, it is unclear what mechanism enables in-context learning. In this paper, we study the role of the pretraining distribution on the emergence of in-context learning under a mathematical setting where the pretraining texts have long-range coherence. Here, language model pretraining requires inferring a latent document-level concept from the conditioning text to generate coherent next tokens. At test time, this mechanism enables in-context learning by inferring the shared latent concept between prompt examples and applying it to make a prediction on the test example. Concretely, we prove that in-context learning occurs implicitly via Bayesian inference of the latent concept when the pretraining distribution is a mixture of HMMs. This can occur despite the distribution mismatch between prompts and pretraining data. In contrast to messy large-scale pretraining datasets for in-context learning in natural language, we generate a family of small-scale synthetic datasets (GINC) where Transformer and LSTM language models both exhibit in-context learning. Beyond the theory which focuses on the effect of the pretraining distribution, we empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
翻译:GPT-3等大型预先培训语言模型具有惊人的能力来进行文文本学习,而该模型仅以输入输出实例的迅速性为条件,就学会了下游任务。语言模型在未经明确事先培训的情况下,在其前传过程中从这些实例中学习,而没有在“分配外”提示上更新参数。因此,不清楚什么机制能够进行文文本学习。在本文件中,我们研究了在数学环境下,在培训前文本具有长期一致性的情况下,在出现文文本学习的数学环境中,预培训前分发的作用。在这里,语言模型前训练需要从调试文本中推断出潜在的文件级别概念,以便产生一致的下一个符号。在测试时,这一机制通过推导出在“分配外分配外分配”前的参数更新而没有在“分配外派”提示上学习这些实例。具体地说,当培训前分发之前的文文本是HMMS的混合物时,我们通过Bayesian Exerview 概念的隐含性地进行文文本学习。这可以发生这种情况,尽管在逻辑上分配的理论性变校程中,在模拟分析中,在模拟前的缩缩缩数据演示中,在进行数据分析中,在进行数据分析前的缩略取数据分析中进行中,因此,在进行数据缩缩略取数据分析时,在数据演示中,在数据演示前的缩略取。