A major challenge of research on non-English machine reading for question answering (QA) is the lack of annotated datasets. In this paper, we present GermanQuAD, a dataset of 13,722 extractive question/answer pairs. To improve the reproducibility of the dataset creation approach and foster QA research on other languages, we summarize lessons learned and evaluate reformulation of question/answer pairs as a way to speed up the annotation process. An extractive QA model trained on GermanQuAD significantly outperforms multilingual models and also shows that machine-translated training data cannot fully substitute hand-annotated training data in the target language. Finally, we demonstrate the wide range of applications of GermanQuAD by adapting it to GermanDPR, a training dataset for dense passage retrieval (DPR), and train and evaluate the first non-English DPR model.
翻译:研究非英语机器读答题(QA)的一个主要挑战是缺乏附加说明的数据集。在本文中,我们介绍了德国QuAD,这是13 722个抽取问答的数据集。为了改进数据集创建方法的可复制性并促进其他语言的QA研究,我们总结了经验教训,评价了重新拟订问答配对的方法,以加快批注过程。在德语QuAD上受过培训的抽取式QA模型大大优于多语种模型,还表明机器翻译的培训数据不能完全取代目标语言的手语附加说明的培训数据。最后,我们通过将其适应德国DPR,即密集通道检索培训数据集,以及培训和评价第一个非英语DPR模型,展示了德国QuAD的广泛应用。