Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context and using this representation to predict the next word. Currently, most LMs calculate these representations through a neural network consuming the immediate previous context. However recently, retrieval-augmented LMs have shown to improve over standard neural LMs, by accessing information retrieved from a large datastore, in addition to their standard, parametric, next-word prediction. In this paper, we set out to understand why retrieval-augmented language models, and specifically why k-nearest neighbor language models (kNN-LMs) perform better than standard parametric LMs, even when the k-nearest neighbor component retrieves examples from the same training set that the LM was originally trained on. To this end, we perform a careful analysis of the various dimensions over which kNN-LM diverges from standard LMs, and investigate these dimensions one by one. Empirically, we identify three main reasons why kNN-LM performs better than standard LMs: using a different input representation for predicting the next tokens, approximate kNN search, and the importance of softmax temperature for the kNN distribution. Further, we incorporate these insights into the model architecture or the training procedure of the standard parametric LM, improving its results without the need for an explicit retrieval component. The code is available at https://github.com/frankxu2004/knnlm-why.
翻译:语言模型( LMS) 测算文本的概率。 语言模型( LMS) 通过按顺序计算一个已见背景的表示值, 并使用这个表示值来预测下一个字词。 目前, 大多数 LMS通过一个消耗前一字的神经网络计算这些表示值。 然而最近, 检索增强 LMS 显示, 通过访问从一个大型数据存储处检索的信息, 以及它们的标准参数、 参数、 下一个词的预测, 从而改善了标准神经LM 。 在本文中, 我们开始理解为什么检索 - 强化语言模型, 特别是 k- 近邻语言模型( kNN- LMs) 比标准的参数 LMS( kNN- LMs) 表现得更好, 即使 k- 最近的邻居组件从最初培训的同一训练中提取了示例。 然而, 我们仔细分析了 kNNM- LM 与标准值的不同维度, 并且对这些维度进行了一次调查 。 Empircly, 我们确定了 kN- LMM 执行比标准的LMSMSMs 更好的三个主要原因: 使用一种不同的温度, 使用不同的输入来预测 KNNNMs 标准结构的模型, 。 。 大约 KNNNSimprealalal 搜索和这些模型的模型的模型, 。