Document retrieval is a core component of many knowledge-intensive natural language processing task formulations such as fact verification and question answering. Sources of textual knowledge, such as Wikipedia articles, condition the generation of answers from the models. Recent advances in retrieval use sequence-to-sequence models to incrementally predict the title of the appropriate Wikipedia page given a query. However, this method requires supervision in the form of human annotation to label which Wikipedia pages contain appropriate context. This paper introduces a distant-supervision method that does not require any annotation to train autoregressive retrievers that attain competitive R-Precision and Recall in a zero-shot setting. Furthermore we show that with task-specific supervised fine-tuning, autoregressive retrieval performance for two Wikipedia-based fact verification tasks can approach or even exceed full supervision using less than $1/4$ of the annotated data indicating possible directions for data-efficient autoregressive retrieval.
翻译:文件检索是许多知识密集型自然语言处理任务配方的核心组成部分,如事实核实和答题等。文字知识来源,如维基百科文章,是生成模型答案的条件。在检索方面最近的进展使用顺序到顺序的模型,逐步预测适当的维基百科页面的标题,给一个查询。然而,这种方法需要以人类注解的形式监督维基百科页面中包含适当背景的标签。本文引入了一种远视方法,不需要任何注解来培训自动递减检索器,以便在零点放位置上实现竞争性R-精度和回调。此外,我们表明,如果有两个基于维基百科的事实核实任务在特定任务监督下进行微调、自动递增检索功能,则使用不到1/4美元的附加说明数据来接近甚至超过全面监督,表明数据效率自动递减检索的可能方向。