With the rise of large-scale pre-trained language models, open-domain question-answering (ODQA) has become an important research topic in NLP. Based on the popular pre-training fine-tuning approach, we posit that an additional in-domain pre-training stage using a large-scale, natural, and diverse question-answering (QA) dataset can be beneficial for ODQA. Consequently, we propose a novel QA dataset based on the Common Crawl project in this paper. Using the readily available schema.org annotation, we extract around 130 million multilingual question-answer pairs, including about 60 million English data-points. With this previously unseen number of natural QA pairs, we pre-train popular language models to show the potential of large-scale in-domain pre-training for the task of question-answering. In our experiments, we find that pre-training question-answering models on our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.
翻译:随着大规模预先培训语言模式的兴起,开放式问题解答(ODQA)已成为NLP的一个重要研究课题。 根据广受欢迎的培训前微调方法,我们假设,使用大规模、自然和多样化的问答数据集,在内部再增加一个培训前阶段,对ODQA有好处。因此,我们提议根据本文中的通用“Crawl”项目建立一个新型的质量解答数据集。我们利用现成的 schema.org annotation,提取了大约1.3亿个多语种问答配对,包括大约6 000万个英语数据配对。由于以前不为人所见的天然的“QA”配对,我们预先培训通用语言模式可以显示大规模在内部进行问答任务培训前的潜力。在我们的实验中,我们发现我们共同的“Crawel”解答数据集(CCQA)的培训前解答模式在多个任务、模式和基准上取得了令人乐观的结果。