Running large-scale pre-trained language models in computationally constrained environments remains a challenging problem yet to be addressed, while transfer learning from these models has become prevalent in Natural Language Processing tasks. Several solutions, including knowledge distillation, network quantization, or network pruning have been previously proposed; however, these approaches focus mostly on the English language, thus widening the gap when considering low-resource languages. In this work, we introduce three light and fast versions of distilled BERT models for the Romanian language: Distil-BERT-base-ro, Distil-RoBERT-base, and DistilMulti-BERT-base-ro. The first two models resulted from the individual distillation of knowledge from two base versions of Romanian BERTs available in literature, while the last one was obtained by distilling their ensemble. To our knowledge, this is the first attempt to create publicly available Romanian distilled BERT models, which were thoroughly evaluated on five tasks: part-of-speech tagging, named entity recognition, sentiment analysis, semantic textual similarity, and dialect identification. Our experimental results argue that the three distilled models maintain most performance in terms of accuracy with their teachers, while being twice as fast on a GPU and ~35% smaller. In addition, we further test the similarity between the predictions of our students versus their teachers by measuring their label and probability loyalty, together with regression loyalty - a new metric introduced in this work.
翻译:在计算受限环境中,正在运行大规模预先培训的大规模语言模型,在计算受限环境中,这仍然是一个有待解决的难题,而从这些模型的转移学习在自然语言处理任务中已经很普遍。以前曾提出过几种解决办法,包括知识蒸馏、网络量化或网络处理;然而,这些办法主要侧重于英语语言,从而在考虑低资源语言时扩大差距。在这项工作中,我们为罗马尼亚语言引入了三种简易和快速版本的蒸馏的BERT模型:Distil-BERT-Basero、Distil-ROBERT-Base和DistilMul-Bulti-BERT-Basero。前两种模式是源于文献中两种基础版本的罗马尼亚BERT知识的提炼、网络量化或网络处理;而最后一种方法则是通过提炼其组合的语言获得的。 据我们所知,这是首次尝试创建公开的罗马尼亚蒸馏的BERT模型,这些模型在五项任务上得到了彻底的评估:部分的标注、命名实体识别、情绪分析、语义理学分析、语义理学类相似性相似性相似性分析、以及辨辨辨辨辨辨的理论识别,我们三个的精度的精确的精确性研究结果,同时,我们又用三个的精确性测试的精确性研究。