News article revision histories have the potential to give us novel insights across varied fields of linguistics and social sciences. In this work, we present, to our knowledge, the first publicly available dataset of news article revision histories, or \textit{NewsEdits}. Our dataset is multilingual; it contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources based in three countries. Across version pairs, we count 10.9 million added sentences; 8.9 million changed sentences and 6.8 million removed sentences. Within the changed sentences, we derive 72 million atomic edits. \textit{NewsEdits} is, to our knowledge, the largest corpus of revision histories of any domain.
翻译:新闻文章修订史有可能给我们提供语言和社会科学各领域的新见解。在这项工作中,据我们所知,我们提供了第一个公开的新闻报道文章修订史数据集,或\ textit{NewsEdits}。我们的数据集是多语种的;它包含1,278,804篇文章,共4,609,430篇,来自位于三个国家的22个英文和法文报纸来源。跨版本,我们算出1,090万个新增句子;890万个变更刑期和680万个删除刑期。在修改后的句子中,我们得出7,200万个原子编辑。根据我们所知,\ textit{NewsEdits}是任何领域最大的修订史。