The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many Natural Language Processing tasks. Over 120 monolingual BERT models covering over 50 languages have been released, as well as a multilingual model trained on 104 languages. We introduce, gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We release gaBERT and related code to the community.
翻译:神经语言模型的BERT家族由于能够提供具有丰富背景敏感性的文本序列,能够概括许多自然语言处理任务,因而变得非常流行。120多个单一语言的BERT模型已经发行,还有104种语言培训的多语种模型。我们为爱尔兰语引入了一种单一语言的BERT模型。我们将我们的GaBERT模型与多语言的BERT模型进行了比较,并表明GaBERT为下游分析任务提供了更好的表述。我们还展示了不同的过滤标准、词汇大小和子词标识模型的选择是如何影响下游绩效的。我们向社区发布了GABERT和相关代码。