This paper surveys 60 English Machine Reading Comprehension datasets, with a view to providing a convenient resource for other researchers interested in this problem. We categorize the datasets according to their question and answer form and compare them across various dimensions including size, vocabulary, data source, method of creation, human performance level, and first question word. Our analysis reveals that Wikipedia is by far the most common data source and that there is a relative lack of why, when, and where questions across datasets.
翻译:本文调查了60个英文机器阅读理解数据集,目的是为关注这一问题的其他研究人员提供方便的资源。 我们根据数据集的问答形式进行分类,并在各层面进行比较,包括大小、词汇、数据来源、创建方法、人类性能水平和第一个问题词。 我们的分析显示维基百科是迄今为止最常见的数据源,相对缺乏为什么、何时、在哪里跨数据集的问题。