Due to the success of pre-trained language models, versions of languages other than English have been released in recent years. This fact implies the need for resources to evaluate these models. In the case of Spanish, there are few ways to systematically assess the models' quality. In this paper, we narrow the gap by building two evaluation benchmarks. Inspired by previous work (Conneau and Kiela, 2018; Chen et al., 2019), we introduce Spanish SentEval and Spanish DiscoEval, aiming to assess the capabilities of stand-alone and discourse-aware sentence representations, respectively. Our benchmarks include considerable pre-existing and newly constructed datasets that address different tasks from various domains. In addition, we evaluate and analyze the most recent pre-trained Spanish language models to exhibit their capabilities and limitations. As an example, we discover that for the case of discourse evaluation tasks, mBERT, a language model trained on multiple languages, usually provides a richer latent representation than models trained only with documents in Spanish. We hope our contribution will motivate a fairer, more comparable, and less cumbersome way to evaluate future Spanish language models.
翻译:由于经过培训的语文模式的成功,近年来发布了英文以外其他语文的版本,这一事实意味着需要资源来评价这些模式。在西班牙文方面,没有多少方法可以系统地评估模型的质量。在本文中,我们通过建立两个评估基准缩小差距。在以往工作(Conneau和Kiela,2018年;Chen等人,2019年)的启发下,我们引入了西班牙语SentEval和西班牙语DiscoEval,分别旨在评估独立和有讨论觉悟的句子表达能力。我们的基准包括大量已有和新建的、涉及不同领域不同任务的数据集。此外,我们评估和分析最新的经过培训的西班牙语模式,以展示其能力和局限性。举例来说,我们发现,在讨论评价任务方面,受过多种语言培训的语文模式MBERT通常比仅用西班牙文培训的模型具有更丰富的潜在代表性。我们希望我们的贡献将激励以更公平、更可比和不麻烦的方式评价未来的西班牙语模式。