The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by time- and labour-consuming annotation, limiting the training data's potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).
翻译:通用 QA 字段一直在开发将斯坦福问答数据集(SQUAD)作为重要基准的参考方法,但汇编事实问题的同时,还附有时间和劳力消耗说明,限制了培训数据的潜在规模。我们提供了维基奥姆尼亚数据集,这是一套新的公开可查的QA-pair和相应的俄罗斯维基百科文章摘要部分,由完全自动化的基因化管道组成。数据集包括了维基百科为俄语提供的每篇文章。WikiOmnia输油管有开放源,并测试在其他领域,如新闻文本、小说和社会媒体上创建SQA格式的QA。由此产生的数据集包括两个部分:整个俄罗斯维基百科的原始数据(7,930,873 QA配对,配有RuGPT-3 XL和7,991,040 QA配对,配有TR5大段)和经过严格自动核查的清理数据(160,000多QA配有ruGPT-3-XL和3,400,000 QA双)。