Recently, pre-trained transformer-based architectures have proven to be very efficient at language modeling and understanding, given that they are trained on a large enough corpus. Applications in language generation for Arabic are still lagging in comparison to other NLP advances primarily due to the lack of advanced Arabic language generation models. In this paper, we develop the first advanced Arabic language generation model, AraGPT2, trained from scratch on a large Arabic corpus of internet text and news articles. Our largest model, AraGPT2-mega, has 1.46 billion parameters, which makes it the largest Arabic language model available. The Mega model was evaluated and showed success on different tasks including synthetic news generation, and zero-shot question answering. For text generation, our best model achieves a perplexity of 29.8 on held-out Wikipedia articles. A study conducted with human evaluators showed the significant success of AraGPT2-mega in generating news articles that are difficult to distinguish from articles written by humans. We thus develop and release an automatic discriminator model with a 98% percent accuracy in detecting model-generated text. The models are also publicly available, hoping to encourage new research directions and applications for Arabic NLP.
翻译:最近,经过培训的基于变压器的建筑在语言建模和理解方面已证明非常有效,因为这些建筑是经过大量培训的,因此在语言建模和理解方面是十分有效的。阿拉伯语语言生成方面的应用仍然落后于国家语言方案的其他进步,这主要是因为缺少先进的阿拉伯语生成模型。在本文中,我们开发了第一个先进的阿拉伯语生成模型,AraGPT2, 在大量阿拉伯文文本和新闻文章上从零开始接受过培训。我们最大的模型AraGPT2-mega拥有14.6亿个参数,因此它成为了最大的阿拉伯语模型。Mega模型被评估并展示了不同任务的成功,包括合成新闻生成和零问题回答。关于文本生成,我们的最佳模型实现了29.8的不易解。与人类评价人员一起进行的一项研究表明,AraGPT2-mega在制作难以与人类撰写的文章区分的新闻文章方面取得了巨大成功。因此,我们开发并发布了一个自动歧视模型,在探测模型生成的文本方面达到98%的准确度。模型还公开提供,希望鼓励新的研究方向和用于阿拉伯文的应用程序。