Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for Natural Language Understanding tasks. Recently, some works geared towards pre-training, specially-crafted models for particular domains, such as scientific papers, medical documents, and others. In this work, we present RoBERTuito, a pre-trained language model for user-generated content in Spanish. We trained RoBERTuito on 500 million tweets in Spanish. Experiments on a benchmark of 4 tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models for Spanish. In order to help further research, we make RoBERTuito publicly available at the HuggingFace model hub.
翻译:自BERT出现以来,变换语言模式和转移学习已成为最先进的自然语言理解任务。最近,一些工作面向培训前,为科学论文、医疗文件等特定领域专门设计的模型等。在这项工作中,我们介绍了RoBERTEDIO,这是经过预先培训的西班牙文用户生成内容的语言模型。我们用西班牙文培训了RoBERTIO5亿次推特。关于涉及用户生成文本的4项任务基准的实验表明,RoBERTETOO比其他西班牙预先培训的语言模型要好。为了帮助进一步研究,我们向Hugging Face模型中心公开了RoBERTEDIO。