Recent work on training neural retrievers for open-domain question answering (OpenQA) has employed both supervised and unsupervised approaches. However, it remains unclear how unsupervised and supervised methods can be used most effectively for neural retrievers. In this work, we systematically study retriever pre-training. We first propose an approach of unsupervised pre-training with the Inverse Cloze Task and masked salient spans, followed by supervised finetuning using question-context pairs. This approach leads to absolute gains of 2+ points over the previous best result in the top-20 retrieval accuracy on Natural Questions and TriviaQA datasets. We also explore two approaches for end-to-end supervised training of the reader and retriever components in OpenQA models. In the first approach, the reader considers each retrieved document separately while in the second approach, the reader considers all the retrieved documents together. Our experiments demonstrate the effectiveness of these approaches as we obtain new state-of-the-art results. On the Natural Questions dataset, we obtain a top-20 retrieval accuracy of 84, an improvement of 5 points over the recent DPR model. In addition, we achieve good results on answer extraction, outperforming recent models like REALM and RAG by 3+ points. We further scale up end-to-end training to large models and show consistent gains in performance over smaller models.
翻译:最近对开放式答题神经检索器(OpenQA)的培训工作采用了受监管和不受监管的方法,然而,仍然不清楚如何能最有效地将不受监管和受监管的方法用于神经检索器。在这项工作中,我们系统地研究检索器预培训。我们首先建议采用未经监管的预培训方法,与反克隆任务和隐藏突出范围相结合,然后通过使用问答对调进行监管的微调。这一方法导致在前一个最佳结果中,在自然问题和TriviaQA数据集前20级检索精确度上,获得2+点的绝对增益。我们还探索了两种方法,即对OpenQA模型的读者和检索器组件进行端对端对端监督的培训。在第一个办法中,读者分别审议每份检索文件,而在第二个办法中,读者将对所有检索的文件进行分别审议。我们的实验表明,当我们获得新的最新技术成果时,这些方法的有效性。在自然问题数据集中,我们获得了84级前20级检索准确度的顶级准确度,对最近3级的升级模型进行了改进,如在最新AG-D-D-S-S-D-S-S-S-S-S-S-S-A-A-A-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-C-C-C-C-B-C-C-B-B-B-B-B-B-C-C-C-C-C-C-C-C-C-C-B-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-M-C-C-C-M-S-S-C-S-C-C-C-C-C-C-S-MA-MA-MA-MA-MA-S-MA-S-S-MA-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-MA-MA-MA-MA-S-A-A-MA-S-T-MA-MA-