We observe a recent behaviour on social media, in which users intentionally remove consonantal dots from Arabic letters, in order to bypass content-classification algorithms. Content classification is typically done by fine-tuning pre-trained language models, which have been recently employed by many natural-language-processing applications. In this work we study the effect of applying pre-trained Arabic language models on "undotted" Arabic texts. We suggest several ways of supporting undotted texts with pre-trained models, without additional training, and measure their performance on two Arabic natural-language-processing downstream tasks. The results are encouraging; in one of the tasks our method shows nearly perfect performance.
翻译:我们观察到最近社交媒体上的一种行为,即用户故意从阿拉伯文字母中去除同义点,以绕过内容分类算法;内容分类通常通过微调预先培训的语言模式进行,这些模式最近被许多自然语言处理应用程序使用;我们在此工作中研究将预先培训的阿拉伯语模式应用于“未达标”阿拉伯文文本的效果;我们建议采取几种方法,在没有额外培训的情况下,用经过预先培训的模型支持已破译的文本,并衡量其在两个经过阿拉伯自然语言处理的下游任务上的表现;结果令人鼓舞;在其中一项任务中,我们的方法表现得几乎完美。