Pre-trained large-scale language models such as BERT have gained a lot of attention thanks to their outstanding performance on a wide range of natural language tasks. However, due to their large number of parameters, they are resource-intensive both to deploy and to fine-tune. Researchers have created several methods for distilling language models into smaller ones to increase efficiency, with a small performance trade-off. In this paper, we create several different distilled versions of the state-of-the-art Dutch RobBERT model and call them RobBERTje. The distillations differ in their distillation corpus, namely whether or not they are shuffled and whether they are merged with subsequent sentences. We found that the performance of the models using the shuffled versus non-shuffled datasets is similar for most tasks and that randomly merging subsequent sentences in a corpus creates models that train faster and perform better on tasks with long sequences. Upon comparing distillation architectures, we found that the larger DistilBERT architecture worked significantly better than the Bort hyperparametrization. Interestingly, we also found that the distilled models exhibit less gender-stereotypical bias than its teacher model. Since smaller architectures decrease the time to fine-tune, these models allow for more efficient training and more lightweight deployment of many Dutch downstream language tasks.
翻译:诸如BERT等经过预先培训的大型语言模型因其在各种自然语言任务方面的杰出表现而得到很多关注。然而,由于参数众多,这些模型在部署和微调上都是资源密集型的。研究人员发明了几种方法,将语言模型蒸馏成较小的模型,以提高效率,同时进行小规模的绩效权衡。在本文中,我们创建了几种先进的荷兰先进 RobBERT 模型的不同版本,并将其称为 RobBERTje。它们的蒸馏材料在蒸馏材料中存在不同之处,即它们是否被抖动,是否与随后的句子合并。我们发现,在多数任务中,使用摇晃动式和非压动式数据集的模型的性能相似,而随后在材料中随机合并的句子可以创造出一些模型,在长序列的任务上,培训速度更快,工作表现得更好。在比较蒸馏结构时,我们发现更大的DettriBERT结构比博特超平衡化语言结构效果要好得多。有趣的是,我们还发现它们是否与随后的句子合并在一起。我们发现,在多数任务中,使用摇晃动式的模型比教学结构更容易地展示了这些模型,因此使得许多的部署模式的压式结构比改进了。