We study the task of generating from Wikipedia articles question-answer pairs that cover content beyond a single sentence. We propose a neural network approach that incorporates coreference knowledge via a novel gating mechanism. Compared to models that only take into account sentence-level information (Heilman and Smith, 2010; Du et al., 2017; Zhou et al., 2017), we find that the linguistic knowledge introduced by the coreference representation aids question generation significantly, producing models that outperform the current state-of-the-art. We apply our system (composed of an answer span extraction system and the passage-level QG system) to the 10,000 top-ranking Wikipedia articles and create a corpus of over one million question-answer pairs. We also provide a qualitative analysis for this large-scale generated corpus from Wikipedia.
翻译:我们研究从维基百科文章中产生涵盖单句以外内容的问答对子的任务。我们建议采用神经网络方法,通过新颖的格子机制将知识纳入共同参考。与只考虑判决一级信息的模型相比(Heilman和Smith,2010年;Du等人,2017年;Zhou等人,2017年),我们发现,共同参考代表引入的语言知识极大地帮助了问题生成,产生了优于目前最新水平的模型。我们对10 000篇维基百科顶级文章采用了我们的系统(由回答跨度提取系统和通过级QG系统组成),并创建了100多万对问答的组合。我们还对这一从维基百科生成的大型材料进行了定性分析。