Lately, pre-trained language models advanced the field of natural language processing (NLP). The introduction of Bidirectional Encoders for Transformers (BERT) and its optimized version RoBERTa have had significant impact and increased the relevance of pre-trained models. First, research in this field mainly started on English data followed by models trained with multilingual text corpora. However, current research shows that multilingual models are inferior to monolingual models. Currently, no German single language RoBERTa model is yet published, which we introduce in this work (GottBERT). The German portion of the OSCAR data set was used as text corpus. In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones. GottBERT was pre-trained related to the original RoBERTa model using fairseq. All downstream tasks were trained using hyperparameter presets taken from the benchmark of German BERT. The experiments were setup utilizing FARM. Performance was measured by the $F_{1}$ score. GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture. Even without extensive hyper-parameter optimization, in all NER and one text classification task, GottBERT already outperformed all other tested German and multilingual models. In order to support the German NLP field, we publish GottBERT under the AGPLv3 license.
翻译:最近,经过培训的语文模型推进了自然语言处理领域(NLP)。为变换者引入双向编码器(BERT)及其优化版RoBERTA引入了双向编码器(BERT)已经产生了重大影响,提高了预培训模型的相关性。首先,这一领域的研究主要从英语数据开始,随后是经过多语言文本公司培训的模型。然而,目前的研究表明,多种语言模型低于单一语言模型。目前,还没有公布任何德国单一语言的RoBERTA模型,我们在此工作中采用(GottBERTER)模型。OSCAR数据集的德国部分被用作文本库。在一项评估中,我们比较了其在2003年Conll(NER)和2014年GermEval两个命名实体识别(NER)任务中的绩效,以及GARDA的文本分类任务(GermEval 2018(f)和corrass)和现有的两种语言BERTER模型的文本。GETERT在使用公平(G GOTER)的原始模型中提前培训。所有下游任务都用超标准前,在德国的TRART1 和RRRRRR1 的模型中进行了测试。