This paper studies multi-task training of retrieval-augmented generation models for knowledge-intensive tasks. We propose to clean the training set by utilizing a distinct property of knowledge-intensive generation: The connection of query-answer pairs to items in the knowledge base. We filter training examples via a threshold of confidence on the relevance labels, whether a pair is answerable by the knowledge base or not. We train a single Fusion-in-Decoder (FiD) generator on seven combined tasks of the KILT benchmark. The experimental results suggest that our simple yet effective approach substantially improves competitive baselines on two strongly imbalanced tasks; and shows either smaller improvements or no significant regression on the remaining tasks. Furthermore, we demonstrate our multi-task training with relevance label sampling scales well with increased model capacity and achieves state-of-the-art results in five out of seven KILT tasks.
翻译:本文研究知识密集型任务检索强化生成模型的多任务培训。我们建议利用知识密集型生成的独特特性来清理培训内容:将问答对口与知识库中的项目连接起来;通过相关标签的信任门槛过滤培训范例,无论一对是否由知识库负责;就知识密集型任务基准的7项综合任务,培训一个单一的融合-入计量(Fid)生成器。实验结果表明,我们简单而有效的方法大大改进了两项高度不平衡任务的竞争基线;在剩余任务上,要么改进较小,要么没有出现重大倒退。此外,我们展示了我们的多任务培训,用相关标签抽样尺度进行,提高模型能力,在7项任务中的5项中取得最新成果。