预测马拉蒂的进攻性社会媒体职位的类型和目标 (Predicting the Type and Target of Offensive Social Media Posts in Marathi)

from arxiv, This is a preprint of an article published in the Journal of Intelligent Information Systems, Springer. The final authenticated version is available online at https://link.springer.com/article/10.1007/s13278-022-00906-8

The presence of offensive language on social media is very common motivating platforms to invest in strategies to make communities safer. This includes developing robust machine learning systems capable of recognizing offensive content online. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English and a few other high resource languages such as French, German, and Spanish. In this paper we address this gap by tackling offensive language identification in Marathi, a low-resource Indo-Aryan language spoken in India. We introduce the Marathi Offensive Language Dataset v.2.0 or MOLD 2.0 and present multiple experiments on this dataset. MOLD 2.0 is a much larger version of MOLD with expanded annotation to the levels B (type) and C (target) of the popular OLID taxonomy. MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi, thus opening new avenues for research in low-resource Indo-Aryan languages. Finally, we also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID.

翻译：社交媒体上存在攻击性语言,这是非常常见的激励平台,有助于投资于使社区更安全的战略,其中包括开发能够识别攻击性内容的强大机器学习系统。除了几个显著的例外外,大多数关于自动攻击性语言识别的研究都涉及英语和其他一些高资源语言,如法语、德语和西班牙语。在本文件中,我们通过解决这一差距,以印度所讲的低资源印度-亚里安语Marathi这种攻击性语言识别问题。我们引入了Marathi进攻性语言数据集 v.2.0 或 MOLD 2.0,并提出了关于该数据集的多重实验。MOLD 2.0 是范围大得多的MOLD, 其批注范围扩大到流行的OLID分类学B(类型)和C(目标)级。MOLD 2.0 是为Marathi编写的第一个等级级攻击性语言数据集,从而开辟了研究低资源印阿良语的新途径。最后,我们还引入了SMOLD,这是在SOLID中介绍的半操作方法之后附加说明的更大数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日