Recent models in developing summarization systems consist of millions of parameters and the model performance is highly dependent on the abundance of training data. While most existing summarization corpora contain data in the order of thousands to one million, generation of large-scale summarization datasets in order of couple of millions is yet to be explored. Practically, more data is better at generalizing the training patterns to unseen data. In this paper, we introduce TLDR9+ -- a large-scale summarization dataset -- containing over 9 million training instances extracted from Reddit discussion forum (https://github.com/sajastu/reddit_collector). This dataset is specifically gathered to perform extreme summarization (i.e., generating one-sentence summary in high compression and abstraction) and is more than twice larger than the previously proposed dataset. We go one step further and with the help of human annotations, we distill a more fine-grained dataset by sampling High-Quality instances from TLDR9+ and call it TLDRHQ dataset. We further pinpoint different state-of-the-art summarization models on our proposed datasets.
翻译:近期发展总结系统的模型包括数以百万计的参数,模型性能高度取决于培训数据的丰度。虽然大多数现有汇总公司包含以千至100万为单位的数据,但是尚未探索以几百万为单位的大规模汇总数据集的生成。实际上,更多的数据更有助于将培训模式归纳为无形数据。在本文中,我们引入了TLDR9+ -- -- 大规模汇总数据集 -- -- 包含从Redddid讨论论坛(https://github.com/sajastu/reddit_commallor)提取的900多万个培训案例。这一数据集专门收集是为了进行极端的汇总(即产生高压缩和抽象的一流摘要),比先前提议的数据集大一倍多。我们走更远一步,借助人类说明,我们通过取样TLDRD9+的高品质实例,并称之为TLDRHQ数据集,来提取一个更精细的精细的数据集。我们进一步在拟议的数据模型上定位不同的州-方-艺术合成模型。