In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.
翻译:在本文中,我们探讨了阿拉伯文经培训前语文模式中语言变异、数据大小和微调任务类型的影响。为此,我们建立了三个经过培训的阿拉伯文三种变异三种语文模式:现代标准阿拉伯文(MSA)、方言阿拉伯文和古典阿拉伯文,以及根据三种变异组合预先培训的第四种语文模式。我们还研究了培训前数据规模的重要性,为此,我们根据经扩大的MSA变异模式,建立了经过预先培训的其他模式。我们相互比较了我们的不同模式,以及8种公开存在的模式,对涉及12个数据集的NLP任务进行了微调。我们的结果表明,培训前数据与微调数据的异接近比培训前数据规模更重要。我们利用这种洞察力,为研究中的任务确定了一个优化的系统选择模式。