We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.
翻译:我们通过对基于当地类似物的大型拼图中提取的文档块进行调整,加强自动递减语言模型。我们用一个价值2万亿美元的象征性数据库(TRERO)获得与GPT-3和Jurassic-1在Pile上的类似性能,尽管使用的参数较少,但使用25美元时数。在微调后,RETRO的性能转化为下游知识密集型任务,如问答。RETRO将一个冷冻的Bert检索器、一个不同的编码器和一个块式交叉注意机制结合起来,以预测基于数量级比培训中通常消耗的数据更多的标码。我们通常从零开始培训RETRO,但也能够迅速用检索和仍然取得良好的性能来改造经过训练的预变压器。我们的工作开辟了新的途径,通过前所未有的明确记忆改进语言模型。