Automatic text summarization has been studied in a variety of domains and languages. However, this does not hold for the Russian language. To overcome this issue, we present Gazeta, the first dataset for summarization of Russian news. We describe the properties of this dataset and benchmark several extractive and abstractive models. We demonstrate that the dataset is a valid task for methods of text summarization for Russian. Additionally, we prove the pretrained mBART model to be useful for Russian text summarization.
翻译:已在多个领域和语言中研究过自动文本汇总。 但是, 这对于俄语来说并不有效 。 为了解决这个问题, 我们介绍俄罗斯新闻汇总的第一个数据集Gazeta。 我们描述该数据集的属性, 并设定若干采掘和抽象模型的基准 。 我们证明该数据集是俄罗斯文本汇总方法的有效任务 。 此外, 我们证明预先训练的 mBART 模型对俄罗斯文本汇总有用 。