In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license and is available on HuggingFace.
翻译:近年来,以变压器为基础的模型导致自然语言处理的语言建模取得显著进展,然而,这些模型需要大量的数据(预先)培训,而且缺乏英文以外的语言的Corpora。最近,一些举措提供了自动上网获取的多语种数据集;然而,西班牙文的结果表明存在重大缺陷,因为它们与其他语言相比太小,或来自亚优清洁和淡化的低质量。在本文中,我们引入了Es Corpius,这是一个西班牙爬行体,取自共同Crawl数据近1磅的西班牙Corpius。Es Corpius是根据CC-NC-ND4.0许可证发放的,在网络文本内容的提取、净化和复制方面质量如此之高。我们的数据整理过程涉及一个全新的高度平行的清理管道,包括一系列的解析机制,共同确保文件和段落边界的完整性。此外,我们维护源网页URL和WAC hard URL,以便向欧盟条例投诉。Esorbisius已经根据CC-NC-ND4.0许可证在Hugh Fasing Fasing上发布。