Over the past decade, we have seen exponential growth in online content fueled by social media platforms. Data generation of this scale comes with the caveat of insurmountable offensive content in it. The complexity of identifying offensive content is exacerbated by the usage of multiple modalities (image, language, etc.), code-mixed language and more. Moreover, even after careful sampling and annotation of offensive content, there will always exist a significant class imbalance between offensive and non-offensive content. In this paper, we introduce a novel Code-Mixing Index (CMI) based focal loss which circumvents two challenges (1) code-mixing in languages (2) class imbalance problem for Dravidian language offense detection. We also replace the conventional dot product-based classifier with the cosine-based classifier which results in a boost in performance. Further, we use multilingual models that help transfer characteristics learnt across languages to work effectively with low resourced languages. It is also important to note that our model handles instances of mixed script (say usage of Latin and Dravidian-Tamil script) as well. To summarize, our model can handle offensive language detection in a low-resource, class imbalanced, multilingual and code-mixed setting.
翻译:在过去的十年中,我们看到社交媒体平台所推动的在线内容的指数增长。这种规模的数据的生成伴随着不可逾越的冒犯性内容的警告。由于使用多种模式(图像、语言等)、代码混合语言等等,识别冒犯性内容的复杂性更加复杂。此外,即使对冒犯性内容进行了仔细的抽样和说明,在攻击性和非攻击性内容之间也总是存在着严重的阶级不平衡。在本文中,我们引入了一个基于代码混合指数的新颖的焦点损失,从而避免了两种挑战:(1)语言的代码混合 (2) 德拉维迪亚语言犯罪侦破的分类不平衡问题。我们还用基于comesine的分类器取代了基于常规的基于产品分类器,这导致绩效的提高。此外,我们使用多种语言模式帮助将不同语言的学习特点转换到与低资源语言的有效工作。同样重要的是,我们的模式处理混合文字的例子(例如拉丁语和德拉维迪安-塔米尔语的文字的使用)以及。总结说,我们的模型可以处理以低资源、多语言、多语言的代码和低资源、多语系的代码中的冒犯性探测。