In this paper, we introduce a new NLP task -- generating short factual articles with references for queries by mining supporting evidence from the Web. In this task, called WebBrain, the ultimate goal is to generate a fluent, informative, and factually-correct short article (e.g., a Wikipedia article) for a factual query unseen in Wikipedia. To enable experiments on WebBrain, we construct a large-scale dataset WebBrain-Raw by extracting English Wikipedia articles and their crawlable Wikipedia references. WebBrain-Raw is ten times larger than the previous biggest peer dataset, which can greatly benefit the research community. From WebBrain-Raw, we construct two task-specific datasets: WebBrain-R and WebBrain-G, which are used to train in-domain retriever and generator, respectively. Besides, we empirically analyze the performances of the current state-of-the-art NLP techniques on WebBrain and introduce a new framework ReGen, which enhances the generation factualness by improved evidence retrieval and task-specific pre-training for generation. Experiment results show that ReGen outperforms all baselines in both automatic and human evaluations.
翻译:本文提出了一项新的自然语言处理任务--通过挖掘分析网络信息来为查询问题生成短小、事实准确和有记录引用的文章。本任务称为WebBrain,其最终目标是能够为未在维基百科中见过的查询问题生成流畅、信息量大和事实准确的文章(例如维基百科文章)。为了实验WebBrain,我们构建了一个大规模数据集WebBrain-Raw,通过提取英文维基百科文章及其可爬维基百科引用。WebBrain-Raw比之前最大的评估数据集大十倍,这可以极大地有益于整个研究社区。从WebBrain-Raw,我们构建了两个任务特定的数据集:WebBrain-R和WebBrain-G,分别用于域内信息检索器的培训和生成器。此外,我们通过对当前最先进的自然语言处理( NLP)技术在WebBrain上的表现进行实证分析,并引入了一种新的框架ReGen,通过改进证据检索和任务特定的预训练生成器来增强生成的事实性。实验结果表明,ReGen在自动和人工评估中都优于所有基线。