Given the impact of language models on the field of Natural Language Processing, a number of Spanish encoder-only masked language models (aka BERTs) have been trained and released. These models were developed either within large projects using very large private corpora or by means of smaller scale academic efforts leveraging freely available data. In this paper we present a comprehensive head-to-head comparison of language models for Spanish with the following results: (i) Previously ignored multilingual models from large companies fare better than monolingual models, substantially changing the evaluation landscape of language models in Spanish; (ii) Results across the monolingual models are not conclusive, with supposedly smaller and inferior models performing competitively. Based on these empirical results, we argue for the need of more research to understand the factors underlying them. In this sense, the effect of corpus size, quality and pre-training techniques need to be further investigated to be able to obtain Spanish monolingual models significantly better than the multilingual ones released by large private companies, specially in the face of rapid ongoing progress in the field. The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem which requires to marry resources (monetary and/or computational) with the best research expertise and practice.
翻译:鉴于语言模型对自然语言处理领域的影响,已经培训和发行了一些西班牙语编码器唯一隐蔽语言模型(aka BERTs),这些模型是在大型项目中使用大型私人公司或利用免费可得数据的较小规模学术努力开发的,本文对西班牙语语言模型进行了全面的头对头比较,结果如下:(一) 以前忽视了大公司的多语模式,比单语模式要好得多,大大改变了西班牙语语言模型的评价面貌;(二) 单语模式的成果并非结论性的,据称是较小和低级的模型具有竞争性。根据这些经验结果,我们主张需要进行更多的研究,以了解这些模型背后的因素。从这个意义上说,需要进一步调查物质规模、质量和预培训技术的影响,以便能够获得比大型私营公司所发行的多语言模型要好得多的西班牙语单一语言模型,特别是在目前西班牙语言模型迅速进步的情况下。最近开发西班牙语语言技术的活动是令人欢迎的,但是我们的成果表明,在建立语言模型时需要一种开放的、资源性和货币化的计算方法。