阿拉伯进攻性语言探测系统转让学习方法 -- -- BERT模式 (Transfer Learning Approach for Arabic Offensive Language Detection System -- BERT-Based Model)

from arxiv, 2021 4th International Conference on Computer Applications & Information Security (ICCAIS) - Contemporary Computer Technologies and Applications

Developing a system to detect online offensive language is very important to the health and the security of online users. Studies have shown that cyberhate, online harassment and other misuses of technology are on the rise, particularly during the global Coronavirus pandemic in 2020. According to the latest report by the Anti-Defamation League (ADL), 35% of online users reported online harassment related to their identity-based characteristics, which is a 3% increase over 2019. Applying advanced techniques from the Natural Language Processing (NLP) field to support the development of an online hate-free community is a critical task for social justice. Transfer learning enhances the performance of the classifier by allowing the transfer of knowledge from one domain or one dataset to others that have not been seen before, thus, supporting the classifier to be more generalizable. In our study, we apply the principles of transfer learning cross multiple Arabic offensive language datasets to compare the effects on system performance. This study aims at investigating the effects of fine-tuning and training Bidirectional Encoder Representations from Transformers (BERT) model on multiple Arabic offensive language datasets individually and testing it using other datasets individually. Our experiment starts with a comparison among multiple BERT models to guide the selection of the main model that is used for our study. The study also investigates the effects of concatenating all datasets to be used for fine-tuning and training BERT model. Our results demonstrate the limited effects of transfer learning on the performance of the classifiers, particularly for highly dialectic comments.

翻译：研究显示,网络仇恨、在线骚扰和其他技术滥用现象正在上升,特别是在2020年全球科罗纳病毒大流行期间。根据反诽谤联盟(ADL)的最新报告,35%的在线用户报告了与其身份特征特征有关的在线骚扰,即2019年增长3%以上。应用自然语言处理(NLP)领域的先进技术支持建立在线无仇恨社区是社会公正的一项关键任务。传输学习允许从一个领域或一个数据集向以前没有看到的其他领域转让知识,从而提高了分类员的性能。在我们的研究中,我们应用了传输学习跨阿拉伯攻击性语言数据集的原则,以比较系统性能。这项研究旨在调查微调和培训来自变换者(BERT)的双向在线无仇恨社区展示作用,这是社会公正的一项关键任务。传输学习通过允许从一个领域或一个数据集向以前没有看到的其他领域或数据集转让知识,从而增强了分类员的性能。因此,支持分类员更加普遍化。在我们的研究中,我们应用了跨阿拉伯语攻击性语言数据集的转移原则来比较系统性能的效果。这项研究旨在调查变异者对多阿拉伯语攻击性语言数据模型模型分析的双级分析说明。开始,并测试我们所使用的主要数据选择模型的测试我们所使用的数据模型,我们所使用的主要研究。