LR-Sum: 较少资源语言简表 (LR-Sum: Summarization for Less-Resourced Languages)

This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages. LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced. We describe our process for extracting and filtering the dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), making it one of the most openly-licensed multilingual summarization datasets. We describe how we plan to use the data for modeling experiments and discuss limitations of the dataset.

翻译：本预印本描述了LR-Sum(这是为便于对资源较少的语言进行自动总结的进一步研究而创建的一套新的许可使用的新数据集)的工作进展情况。LR-Sum(LR-Sum)载有40种语言的人类写摘要,其中许多语言的资源较少。我们描述了我们从多语言开放文本资料库(Palen-Michel等人,2022年)提取和过滤数据集的过程。源数据是从美国之音网站收集的公共域新闻网,LR-Sum是根据创用通用许可证(CC BY4.0)发布的,使其成为最公开许可使用的多语种汇总数据集之一。我们描述了我们计划如何利用这些数据进行模拟试验和讨论数据集的局限性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日