This paper presents a monolingual BERT model for Galician. We follow the recent trend that shows that it is feasible to build robust monolingual BERT models even for relatively low-resource languages, while performing better than the well-known official multilingual BERT (mBERT). More particularly, we release two monolingual Galician BERT models, built using 6 and 12 transformer layers, respectively; trained with limited resources (~45 million tokens on a single GPU of 24GB). We then provide an exhaustive evaluation on a number of tasks such as POS-tagging, dependency parsing and named entity recognition. For this purpose, all these tasks are cast in a pure sequence labeling setup in order to run BERT without the need to include any additional layers on top of it (we only use an output classification layer to map the contextualized representations into the predicted label). The experiments show that our models, especially the 12-layer one, outperform the results of mBERT in most tasks.
翻译:本文为加利西亚语提供了一种单一语言的BERT模式。 我们跟踪了最近的趋势,显示即使为相对低资源语言也能够建立强大的单一语言的BERT模式,但比众所周知的官方多语言的BERT(mBERT)表现更好。 更具体地说,我们发行了两种单一语言的Galician BERT模型,分别使用6和12个变压器层;经过有限的资源培训(24GB的单一GPU上只有4,500万个标记)。 然后,我们详尽地评估了一些任务,例如POS-tagging、依赖性分割和命名实体识别。 为此,所有这些任务都以纯粹的顺序设置了标签,以便运行BERT,而无需在上面加任何一层(我们只使用产出分类层将背景化的表达方式映射到预测的标签中)。 实验显示,我们的模型,特别是12层的模型,在多数任务中都超过了 mBERT的结果。