Retrieval-enhanced language models (LMs), which condition their predictions on text retrieved from large external datastores, have recently shown significant perplexity improvements compared to standard LMs. One such approach, the $k$NN-LM, interpolates any existing LM's predictions with the output of a $k$-nearest neighbors model and requires no additional training. In this paper, we explore the importance of lexical and semantic matching in the context of items retrieved by $k$NN-LM. We find two trends: (1) the presence of large overlapping $n$-grams between the datastore and evaluation set plays an important factor in strong performance, even when the datastore is derived from the training data; and (2) the $k$NN-LM is most beneficial when retrieved items have high semantic similarity with the query. Based on our analysis, we define a new formulation of the $k$NN-LM that uses retrieval quality to assign the interpolation coefficient. We empirically measure the effectiveness of our approach on two English language modeling datasets, Wikitext-103 and PG-19. Our re-formulation of the $k$NN-LM is beneficial in both cases, and leads to nearly 4% improvement in perplexity on the Wikitext-103 test set.
翻译:以大型外部数据存储处所检索的文本作为预测条件的Retrieval-增强语言模型(LMS)最近显示出与标准LMS相比的显著不易理解性改进。 其中一种方法,即$k$NNN-LM,将现有LM的预测与美元最接近的邻居模型的输出相提并论,不需要额外的培训。在本文中,我们探讨了用美元103NNNN-LM所检索的项目进行词汇和语义匹配的重要性。我们发现两种趋势:(1) 数据存储处与评价组之间有大量重叠的美元($g),这在高性能方面是一个重要因素,即使数据存储处来自培训数据数据数据;(2) 美元NNNM-LM所现有的预测与查询的语义性相似。我们根据我们的分析,定义了美元NNNNM-LM的新配方,使用检索质量来分配内部系数。我们用经验测量了我们两种英语建模方法的实效。 Wki-19M 测试案例中,WikNF-R-R-TR 和Rest-CR 4-Rest-Rest-Rest-Reving Rest Lest Case just 4-Rism-Ris-Rism- just 4-R 4-Rism-Rism 和Wex 4-R-I-R-R-R-R-R-R-R-R-R-Rism-Risp-Ris jism-Riscism-Rism- jiscisciscism- jex jexp- jex.