This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages. LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced. We describe our process for extracting and filtering the dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), making it one of the most openly-licensed multilingual summarization datasets. We describe how we plan to use the data for modeling experiments and discuss limitations of the dataset.
翻译:本预印本描述了LR-Sum(这是为便于对资源较少的语言进行自动总结的进一步研究而创建的一套新的许可使用的新数据集)的工作进展情况。LR-Sum(LR-Sum)载有40种语言的人类写摘要,其中许多语言的资源较少。我们描述了我们从多语言开放文本资料库(Palen-Michel等人,2022年)提取和过滤数据集的过程。源数据是从美国之音网站收集的公共域新闻网,LR-Sum是根据创用通用许可证(CC BY4.0)发布的,使其成为最公开许可使用的多语种汇总数据集之一。我们描述了我们计划如何利用这些数据进行模拟试验和讨论数据集的局限性。