Large transformer-based models are able to perform in-context few-shot learning, without being explicitly trained for it. This observation raises the question: what aspects of the training regime lead to this emergent behavior? Here, we show that this behavior is driven by the distributions of the training data itself. In-context learning emerges when the training data exhibits particular distributional properties such as burstiness (items appear in clusters rather than being uniformly distributed over time) and having large numbers of rarely occurring classes. In-context learning also emerges more strongly when item meanings or interpretations are dynamic rather than fixed. These properties are exemplified by natural language, but are also inherent to naturalistic data in a wide range of other domains. They also depart significantly from the uniform, i.i.d. training distributions typically used for standard supervised learning. In our initial experiments, we found that in-context learning traded off against more conventional weight-based learning, and models were unable to achieve both simultaneously. However, our later experiments uncovered that the two modes of learning could co-exist in a single model when it was trained on data following a skewed Zipfian distribution -- another common property of naturalistic data, including language. In further experiments, we found that naturalistic data distributions were only able to elicit in-context learning in transformers, and not in recurrent models. In sum, our findings indicate how the transformer architecture works together with particular properties of the training data to drive the intriguing emergent in-context learning behaviour of large language models, and how future work might encourage both in-context and in-weights learning in domains beyond language.
翻译:大型变压器模型能够在没有对其进行明确培训的情况下, 进行文字化的微小学习, 但没有为此进行明确培训。 这一观察提出了这样一个问题: 培训制度的哪些方面导致了这种突发行为? 在这里, 我们显示这种行为是由培训数据本身的分布驱动的。 当培训数据显示出诸如突发性( 项目在集群中出现, 而不是在一段时间内统一分布) 和大量罕见的班级等特定的分布性特性时, 文化学习会出现。 当项目的含义或解释是动态的而不是固定的时, 文性学习也会更强烈。 这些特性以自然语言为范例, 但也是自然数据在一系列的( i.d.) 培训分布通常用于标准监督学习。 在最初的实验中, 我们发现, 文性学习与较传统的重学习同时无法同时进行。 但是, 我们后来的实验发现, 当我们学习了一种变压性语言的模型后, 两种形式的学习模式在单一的模型中会共同存在。