The internet contains a wealth of knowledge -- from the birthdays of historical figures to tutorials on how to code -- all of which may be learned by language models. However, there is a huge variability in the number of times a given piece of information appears on the web. In this paper, we study the relationship between the knowledge memorized by large language models and the information in their pre-training datasets. In particular, we show that a language model's ability to answer a fact-based question relates to how many documents associated with that question were seen during pre-training. We identify these relevant documents by entity linking pre-training datasets and counting documents that contain the same entities as a given question-answer pair. Our results demonstrate strong correlational and causal relationships between accuracy and relevant document count for numerous question answering datasets (e.g., TriviaQA), pre-training corpora (e.g., ROOTS), and model sizes (e.g., 176B parameters). Moreover, we find that while larger models are better at learning long-tail knowledge, we estimate that today's models must be scaled by many orders of magnitude to reach competitive QA performance on questions with little support in the pre-training data. Finally, we show that retrieval-augmentation can reduce the dependence on relevant document count, presenting a promising approach for capturing the long-tail.
翻译:互联网包含丰富的知识 -- -- 从历史数字的生日到关于如何编码的教程 -- -- 所有这些都可以通过语言模型来学习。然而,一个特定的信息在网络上出现的次数变化很大。在本文中,我们研究了由大语言模型记忆的知识与培训前数据集中的信息之间的关系。我们特别表明,语言模型回答一个事实问题的能力涉及在培训前看到多少与该问题相关的文件。我们通过将培训前数据集和含有相同实体的文件作为给定的问答对齐的实体,我们找出了这些相关文件。我们的结果表明,准确性和相关文件在回答数据集(例如,TriviaQA)、培训前公司(例如,ROOTS)和模型大小(例如,176B参数)的众多问题之间有着密切的关联和因果关系。此外,我们发现,虽然较大的模型在学习长尾知识方面做得更好,但我们估计,今天的模型在包含相同实体的文件与给定的问答对齐。我们估计,在准确性和相关文件的计数之间必须有很强的关联性关系和因果关系。最后,要用许多具有竞争力的排序来显示我们所要达到的相关性的文件检索。