We introduce ART, a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. ART, in contrast, only requires access to unpaired inputs and outputs (e.g. questions and potential answer documents). It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question. Training for retrieval based on question reconstruction enables effective unsupervised learning of both document and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning. Extensive experiments demonstrate that ART obtains state-of-the-art results on multiple QA retrieval benchmarks with only generic initialization from a pre-trained language model, removing the need for labeled data and task-specific losses.
翻译:我们引入了ART, 这是一种用于培训不要求任何标签培训数据的密集检索模型的新的物理级自动编码方法。 大量检索是开放域任务(如开放质量A)的一个中心挑战, 开放域任务( 开放质量A) 中, 最先进的方法通常需要使用定制硬负式采矿和排除积极实例的大型监管数据集。 相比之下, ART 仅仅需要获得未受保护的投入和产出( 如问答文件)。 它使用一种新的文件检索自动编码方案, 使用这种方案:(1) 输入问题来检索一套证据文件, 然后再使用文件来计算重建原始问题的概率。 基于问题重建的检索培训能够有效地不受监督地学习文档和问题编码, 以后可以在不作任何进一步微调的情况下将其纳入完整的开放质量A系统。 广泛的实验表明, ART在多种QA检索基准上获得了最新的最新结果, 仅从预先培训的语言模型中进行通用初始化, 取消了对标签数据和任务损失的具体要求。