Currently, the most widespread neural network architecture for training language models is the so called BERT which led to improvements in various NLP tasks. In general, the larger the number of parameters in a BERT model, the better the results obtained in these NLP tasks. Unfortunately, the memory consumption and the training duration drastically increases with the size of these models, though. In this article, we investigate various training techniques of smaller BERT models and evaluate them on five public German NER tasks of which two are introduced by this article. We combine different methods from other BERT variants like ALBERT, RoBERTa, and relative positional encoding. In addition, we propose two new fine-tuning techniques leading to better performance: CSE-tagging and a modified form of LCRF. Furthermore, we introduce a new technique called WWA which reduces BERT memory usage and leads to a small increase in performance.
翻译:目前,培训语言模型最广泛的神经网络结构是所谓的BERT,它导致各种NLP任务的改进。一般来说,BERT模型的参数数量越多,这些NLP任务取得的结果越好。不幸的是,随着这些模型的规模,记忆消耗和培训期限随着这些模型的大小而急剧增加。在本篇文章中,我们调查了小型BERT模型的各种培训技术,并评估了这5个德国公共NER任务,其中2个是由本条款引入的。我们结合了与ALBERT、ROBERTA和相对位置编码等其他BERT变体的不同方法。此外,我们提出了两种新的微调技术,以导致更好的性能:CSE标记和修改的LCRF形式。此外,我们引入了一种叫WWA的新技术,该技术减少了BERT记忆的使用,并导致绩效的微增。