The presence of offensive language on social media is very common motivating platforms to invest in strategies to make communities safer. This includes developing robust machine learning systems capable of recognizing offensive content online. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English and a few other high resource languages such as French, German, and Spanish. In this paper we address this gap by tackling offensive language identification in Marathi, a low-resource Indo-Aryan language spoken in India. We introduce the Marathi Offensive Language Dataset v.2.0 or MOLD 2.0 and present multiple experiments on this dataset. MOLD 2.0 is a much larger version of MOLD with expanded annotation to the levels B (type) and C (target) of the popular OLID taxonomy. MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi, thus opening new avenues for research in low-resource Indo-Aryan languages. Finally, we also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID.
翻译:社交媒体上存在攻击性语言,这是非常常见的激励平台,有助于投资于使社区更安全的战略,其中包括开发能够识别攻击性内容的强大机器学习系统。除了几个显著的例外外,大多数关于自动攻击性语言识别的研究都涉及英语和其他一些高资源语言,如法语、德语和西班牙语。在本文件中,我们通过解决这一差距,以印度所讲的低资源印度-亚里安语Marathi这种攻击性语言识别问题。我们引入了Marathi进攻性语言数据集 v.2.0 或 MOLD 2.0,并提出了关于该数据集的多重实验。MOLD 2.0 是范围大得多的MOLD, 其批注范围扩大到流行的OLID分类学B(类型)和C(目标)级。MOLD 2.0 是为Marathi编写的第一个等级级攻击性语言数据集,从而开辟了研究低资源印阿良语的新途径。最后,我们还引入了SMOLD,这是在SOLID中介绍的半操作方法之后附加说明的更大数据集。