Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for Natural Language Understanding tasks. Recently, some works geared towards pre-training specially-crafted models for particular domains, such as scientific papers, medical documents, user-generated texts, among others. These domain-specific models have been shown to improve performance significantly in most tasks. However, for languages other than English such models are not widely available. In this work, we present RoBERTuito, a pre-trained language model for user-generated text in Spanish, trained on over 500 million tweets. Experiments on a benchmark of tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models in Spanish. In addition to this, our model achieves top results for some English-Spanish tasks of the Linguistic Code-Switching Evaluation benchmark (LinCE) and has also competitive performance against monolingual models in English tasks. To facilitate further research, we make RoBERTuito publicly available at the HuggingFace model hub together with the dataset used to pre-train it.
翻译:自BERT出现以来,变换语言模式和转移学习已成为最先进的自然语言理解任务。最近,一些针对特定领域,例如科学论文、医疗文件、用户生成的文本等专门设计模型的预培训工作,这些特定领域的模型在多数任务中表现显著,但对于英语以外的语言来说,这类模型并不普遍。在这项工作中,我们介绍了RoBERTEDIO,一种经过预先培训的西班牙语文本语言模型,有5亿多份Twitter培训。关于用户生成文本任务基准的实验表明,RoBERTEDIO比其他西班牙预先培训的语言模型要好。除此之外,我们的模型在语言代码转换评价基准(Lince)中的一些英语-西班牙语任务中取得了顶级成果,并且与英语任务中单语模式的绩效也比较竞争。为了便利进一步研究,我们让RoBERTETITOIO在Hugging Face模型中心公开使用用于预培训的数据集。