WikiOmnia: 整个俄罗斯维基百科的QA创集 (WikiOmnia: generative QA corpus on the whole Russian Wikipedia) - 专知论文

会员服务 ·

0

自动问答 · 维基百科 · Automator · 数据集 · 讲稿 ·

2022 年 4 月 17 日

WikiOmnia: generative QA corpus on the whole Russian Wikipedia

翻译：WikiOmnia: 整个俄罗斯维基百科的QA创集

Dina Pisarevskaya,Tatiana Shavrina

The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by time- and labour-consuming annotation, limiting the training data's potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).

翻译：通用 QA 字段一直在开发将斯坦福问答数据集(SQUAD)作为重要基准的参考方法,但汇编事实问题的同时,还附有时间和劳力消耗说明,限制了培训数据的潜在规模。我们提供了维基奥姆尼亚数据集,这是一套新的公开可查的QA-pair和相应的俄罗斯维基百科文章摘要部分,由完全自动化的基因化管道组成。数据集包括了维基百科为俄语提供的每篇文章。WikiOmnia输油管有开放源,并测试在其他领域,如新闻文本、小说和社会媒体上创建SQA格式的QA。由此产生的数据集包括两个部分:整个俄罗斯维基百科的原始数据(7,930,873 QA配对,配有RuGPT-3 XL和7,991,040 QA配对,配有TR5大段)和经过严格自动核查的清理数据(160,000多QA配有ruGPT-3-XL和3,400,000 QA双)。

0

相关内容

自动问答

自动问答（Question Answering, QA）是指利用计算机自动回答用户所提出的问题以满足用户知识需求的任务。不同于现有搜索引擎，问答系统是信息服务的一种高级形式，系统返回用户的不再是基于关键词匹配排序的文档列表，而是精准的自然语言答案。近年来，随着人工智能的飞速发展，自动问答已经成为倍受关注且发展前景广泛的研究方向。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

因果知识图谱自然语言理解

专知会员服务

81+阅读 · 2021年7月3日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【AAAI2020】知识图谱的生成式对抗零样本关系学习，Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs

【AAAI2020】知识图谱的生成式对抗零样本关系学习，Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs

专知会员服务

64+阅读 · 2020年1月11日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

征稿 | CFP：Special Issue of NLP and KG(JCR Q2，IF2.67)

征稿 | CFP：Special Issue of NLP and KG(JCR Q2，IF2.67)

开放知识图谱

1+阅读 · 2022年4月4日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【Github】All4NLP：自然语言处理相关资源整理

【Github】All4NLP：自然语言处理相关资源整理

AINLP

23+阅读 · 2019年8月9日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新八篇情感分析相关论文—Pair-wise判别器、多模态情感分析、上下文语境、Gated 卷积网络

【论文推荐】最新八篇情感分析相关论文—Pair-wise判别器、多模态情感分析、上下文语境、Gated 卷积网络

专知

20+阅读 · 2018年6月29日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

专知

15+阅读 · 2018年5月1日

对称平方L-函数的均值估计及其应用

国家自然科学基金

1+阅读 · 2013年12月31日

基于表面等离激元微纳结构的热载流子光电转换研究

国家自然科学基金

0+阅读 · 2013年12月31日

无穷Laplace方程解的边界正则性

国家自然科学基金

0+阅读 · 2013年12月31日

基于FrameNet的中文评价词汇本体构建与观点挖掘研究

国家自然科学基金

1+阅读 · 2013年12月31日

基于复杂网络的中文文本语义相似度研究

国家自然科学基金

3+阅读 · 2012年12月31日

痕量气体卫星反演中大气Ring效应的同步探测机理与估算模型研究

国家自然科学基金

0+阅读 · 2012年12月31日

针对辐射流体的区域分解预处理Newton-Krylov方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

具有多层次导电网络的尖晶石钛酸锂/碳纳米复合结构的构筑及其高倍率储放锂性能研究

国家自然科学基金

0+阅读 · 2009年12月31日

黑河流域生态补偿研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于数字拓扑的遥感影像空间推理模型的研究

国家自然科学基金

0+阅读 · 2008年12月31日

Masked Unsupervised Self-training for Zero-shot Image Classification

Arxiv

0+阅读 · 2022年6月7日

A Multimodal Corpus for Emotion Recognition in Sarcasm

Arxiv

0+阅读 · 2022年6月5日

QAGCN: A Graph Convolutional Network-based Multi-Relation Question Answering System

Arxiv

0+阅读 · 2022年6月3日

Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian

Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian

Arxiv

0+阅读 · 2022年6月3日

TCE at Qur'an QA 2022: Arabic Language Question Answering Over Holy Qur'an Using a Post-Processed Ensemble of BERT-based Models

Arxiv

0+阅读 · 2022年6月3日

Formalizing Human Ingenuity: A Quantitative Framework for Coyright Law's Substantial Similarity

Arxiv

0+阅读 · 2022年6月2日

Contextualization for the Organization of Text Documents Streams

Arxiv

0+阅读 · 2022年5月30日

Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation

Arxiv

20+阅读 · 2020年12月22日

Latent Relation Language Models

Arxiv

21+阅读 · 2019年8月21日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

VIP会员

文章信息

相关主题

相关VIP内容

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

因果知识图谱自然语言理解

专知会员服务

81+阅读 · 2021年7月3日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【AAAI2020】知识图谱的生成式对抗零样本关系学习，Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs

【AAAI2020】知识图谱的生成式对抗零样本关系学习，Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs

专知会员服务

64+阅读 · 2020年1月11日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

热门VIP内容

开通专知VIP会员享更多权益服务

最新《扩散模型原理》新书，470页pdf

无人机作战：演进、创新与未来战场

AI 智能体简史

多模态空间推理在大模型时代：综述与基准测试

相关资讯

征稿 | CFP：Special Issue of NLP and KG(JCR Q2，IF2.67)

征稿 | CFP：Special Issue of NLP and KG(JCR Q2，IF2.67)

开放知识图谱

1+阅读 · 2022年4月4日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【Github】All4NLP：自然语言处理相关资源整理

【Github】All4NLP：自然语言处理相关资源整理

AINLP

23+阅读 · 2019年8月9日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新八篇情感分析相关论文—Pair-wise判别器、多模态情感分析、上下文语境、Gated 卷积网络

【论文推荐】最新八篇情感分析相关论文—Pair-wise判别器、多模态情感分析、上下文语境、Gated 卷积网络

专知

20+阅读 · 2018年6月29日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

【论文推荐】最新十篇机器翻译相关论文—自然语言推理、无监督神经机器翻译、多任务学习、局部卷积、图卷积、多语种机器翻译

专知

15+阅读 · 2018年5月1日

相关论文

Masked Unsupervised Self-training for Zero-shot Image Classification

Arxiv

0+阅读 · 2022年6月7日

A Multimodal Corpus for Emotion Recognition in Sarcasm

Arxiv

0+阅读 · 2022年6月5日

QAGCN: A Graph Convolutional Network-based Multi-Relation Question Answering System

Arxiv

0+阅读 · 2022年6月3日

Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian

Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian

Arxiv

0+阅读 · 2022年6月3日

TCE at Qur'an QA 2022: Arabic Language Question Answering Over Holy Qur'an Using a Post-Processed Ensemble of BERT-based Models

Arxiv

0+阅读 · 2022年6月3日

Formalizing Human Ingenuity: A Quantitative Framework for Coyright Law's Substantial Similarity

Arxiv

0+阅读 · 2022年6月2日

Contextualization for the Organization of Text Documents Streams

Arxiv

0+阅读 · 2022年5月30日

Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation

Arxiv

20+阅读 · 2020年12月22日

Latent Relation Language Models

Arxiv

21+阅读 · 2019年8月21日

Pre-Training with Whole Word Masking for Chinese BERT

Arxiv

11+阅读 · 2019年6月19日

相关基金

对称平方L-函数的均值估计及其应用

国家自然科学基金

1+阅读 · 2013年12月31日

基于表面等离激元微纳结构的热载流子光电转换研究

国家自然科学基金

0+阅读 · 2013年12月31日

无穷Laplace方程解的边界正则性

国家自然科学基金

0+阅读 · 2013年12月31日

基于FrameNet的中文评价词汇本体构建与观点挖掘研究

国家自然科学基金

1+阅读 · 2013年12月31日

基于复杂网络的中文文本语义相似度研究

国家自然科学基金

3+阅读 · 2012年12月31日

痕量气体卫星反演中大气Ring效应的同步探测机理与估算模型研究

国家自然科学基金

0+阅读 · 2012年12月31日

针对辐射流体的区域分解预处理Newton-Krylov方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

具有多层次导电网络的尖晶石钛酸锂/碳纳米复合结构的构筑及其高倍率储放锂性能研究

国家自然科学基金

0+阅读 · 2009年12月31日

黑河流域生态补偿研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于数字拓扑的遥感影像空间推理模型的研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员