This paper presents the Spanish RoBERTa-base and RoBERTa-large models, as well as the corresponding performance evaluations. Both models were pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019. We extended the current evaluation datasets with an extractive Question Answering dataset and our models outperform the existing Spanish models across tasks and settings.
翻译:本文件介绍了西班牙RoBERTA基地和RoBERTA大型模型以及相应的绩效评估,这两个模型都使用迄今已知的西班牙最大版图进行了预先培训,为这项工作共处理了570GB的清洁和易复制文本,这些文本从2009年西班牙国家图书馆进行的网络检索汇编到2019年,我们扩大了目前的评价数据集,建立了一个 " 抽取问题回答 " 数据集,我们的模型超越了现有的西班牙模型,跨越任务和设置。