This paper gives comprehensive analyses of corpora based on Wikipedia for several tasks in question answering. Four recent corpora are collected,WikiQA, SelQA, SQuAD, and InfoQA, and first analyzed intrinsically by contextual similarities, question types, and answer categories. These corpora are then analyzed extrinsically by three question answering tasks, answer retrieval, selection, and triggering. An indexing-based method for the creation of a silver-standard dataset for answer retrieval using the entire Wikipedia is also presented. Our analysis shows the uniqueness of these corpora and suggests a better use of them for statistical question answering learning.
翻译:本文根据维基百科对若干答题任务进行综合分析。 收集了四个最近的公司, 维基百科、 塞尔QA、 SQA、 SQAD 和 InfoQA, 并且首先通过背景相似性、 问题类型和回答类别进行内在分析。 这些公司然后通过三个答题任务( 回答检索、 选择和触发) 进行外部分析。 还有一个基于索引的方法, 用于创建用于使用整个维基百科检索答题的银标准数据集。 我们的分析显示了这些公司的独特性, 并表明这些公司在回答统计问题的学习中可以更好地利用它们。