We introduce ART, a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. ART, in contrast, only requires access to unpaired inputs and outputs (e.g. questions and potential answer documents). It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question. Training for retrieval based on question reconstruction enables effective unsupervised learning of both document and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning. Extensive experiments demonstrate that ART obtains state-of-the-art results on multiple QA retrieval benchmarks with only generic initialization from a pre-trained language model, removing the need for labeled data and task-specific losses.
翻译:我们引入了ART,这是一种新的语料库级别的自编码方法,用于训练稠密的检索模型,它不需要任何标记的训练数据。稠密检索是开放式领域任务的中心挑战,例如开放式问答(Open QA),其中最先进的方法通常需要大型监督数据集,具有定制的难例负面样本挖掘和积极例去噪声。相比之下,ART仅需要访问不成对的输入和输出(例如问题和潜在的答案文档)。它使用一种新的文档检索自编码方案,其中(1)输入的问题用于检索一组证据文档,(2)然后使用文档来计算重构原始问题的概率。基于问题重构的检索训练使得文档和问题编码器的有效无监督学习成为可能,这些编码器可以稍后集成到完整的Open QA系统中,而不需要进一步的微调。广泛的实验表明,即使只有来自预训练语言模型的通用初始化,ART在多个QA检索基准测试中都可以获得最先进的结果,消除了标记数据和任务特定损失的需要。