During the last two decades, we have progressively turned to the Internet and social media to find news, entertain conversations and share opinion. Recently, OpenAI has developed a ma-chine learning system called GPT-2 for Generative Pre-trained Transformer-2, which can pro-duce deepfake texts. It can generate blocks of text based on brief writing prompts that look like they were written by humans, facilitating the spread false or auto-generated text. In line with this progress, and in order to counteract potential dangers, several methods have been pro-posed for detecting text written by these language models. In this paper, we propose a transfer learning based model that will be able to detect if an Arabic sentence is written by humans or automatically generated by bots. Our dataset is based on tweets from a previous work, which we have crawled and extended using the Twitter API. We used GPT2-Small-Arabic to generate fake Arabic Sentences. For evaluation, we compared different recurrent neural network (RNN) word embeddings based baseline models, namely: LSTM, BI-LSTM, GRU and BI-GRU, with a transformer-based model. Our new transfer-learning model has obtained an accuracy up to 98%. To the best of our knowledge, this work is the first study where ARABERT and GPT2 were combined to detect and classify the Arabic auto-generated texts.
翻译:在过去20年中,我们逐渐转向互联网和社交媒体,以寻找新闻、娱乐谈话和分享观点。最近,OpenAI开发了一个称为GPT-2的机械化学习系统,名为GPT-2,用于培养培训前先导变异器-2,该系统可以产生深假文本。它可以基于简短的写作提示产生一组文本,看起来像是人类写的,便于传播假文本或自动生成文本。根据这一进展,为了消除潜在危险,我们用几种方法来探测这些语言模型编写的文本。在本文中,我们提出了一个基于传输的学习模式,能够检测一个阿拉伯语句子是人类写的还是机器人自动生成的。我们的数据集基于以前工作的推文,我们用TwitterAPI来爬升和扩展了这些文本。我们用GPT2-Small-阿拉伯文来生成假的文字。为了评估,我们比较了不同的经常性线性网络(RNN)词嵌入基准模型,即:LSTM、BILSTM、GRU和BI-RU的转移模型,这是我们从一个自动转换到GRU的文本,这是我们从一个自动转换到一个ALU的自动的系统。我们的一个ALI-RI-I-RI-R的升级的升级,这是我们的一项新的变换到一个基于的自动的升级的系统。