使用基于代码混合指数的协调人损失 (Offense Detection in Dravidian Languages using Code-Mixing Index based Focal Loss)

Over the past decade, we have seen exponential growth in online content fueled by social media platforms. Data generation of this scale comes with the caveat of insurmountable offensive content in it. The complexity of identifying offensive content is exacerbated by the usage of multiple modalities (image, language, etc.), code-mixed language and more. Moreover, even after careful sampling and annotation of offensive content, there will always exist a significant class imbalance between offensive and non-offensive content. In this paper, we introduce a novel Code-Mixing Index (CMI) based focal loss which circumvents two challenges (1) code-mixing in languages (2) class imbalance problem for Dravidian language offense detection. We also replace the conventional dot product-based classifier with the cosine-based classifier which results in a boost in performance. Further, we use multilingual models that help transfer characteristics learnt across languages to work effectively with low resourced languages. It is also important to note that our model handles instances of mixed script (say usage of Latin and Dravidian-Tamil script) as well. To summarize, our model can handle offensive language detection in a low-resource, class imbalanced, multilingual and code-mixed setting.

翻译：在过去的十年中,我们看到社交媒体平台所推动的在线内容的指数增长。这种规模的数据的生成伴随着不可逾越的冒犯性内容的警告。由于使用多种模式(图像、语言等)、代码混合语言等等,识别冒犯性内容的复杂性更加复杂。此外,即使对冒犯性内容进行了仔细的抽样和说明,在攻击性和非攻击性内容之间也总是存在着严重的阶级不平衡。在本文中,我们引入了一个基于代码混合指数的新颖的焦点损失,从而避免了两种挑战:(1)语言的代码混合 (2) 德拉维迪亚语言犯罪侦破的分类不平衡问题。我们还用基于comesine的分类器取代了基于常规的基于产品分类器,这导致绩效的提高。此外,我们使用多种语言模式帮助将不同语言的学习特点转换到与低资源语言的有效工作。同样重要的是,我们的模式处理混合文字的例子(例如拉丁语和德拉维迪安-塔米尔语的文字的使用)以及。总结说,我们的模型可以处理以低资源、多语言、多语言的代码和低资源、多语系的代码中的冒犯性探测。

相关内容

RetinaNet

关注 7

RetinaNet是2018年Facebook AI团队在目标检测领域新的贡献。它的重要作者名单中Ross Girshick与Kaiming He赫然在列。来自Microsoft的Sun Jian团队与现在Facebook的Ross/Kaiming团队在当前视觉目标分类、检测领域有着北乔峰、南慕容一般的独特地位。这两个实验室的文章多是行业里前进方向的提示牌。 RetinaNet只是原来FPN网络与FCN网络的组合应用，因此在目标网络检测框架上它并无特别亮眼创新。文章中最大的创新来自于Focal loss的提出及在单阶段目标检测网络RetinaNet（实质为Resnet + FPN + FCN）的成功应用。Focal loss是一种改进了的交叉熵(cross-entropy, CE)loss，它通过在原有的CE loss上乘了个使易检测目标对模型训练贡献削弱的指数式，从而使得Focal loss成功地解决了在目标检测时，正负样本区域极不平衡而目标检测loss易被大批量负样本所左右的问题。此问题是单阶段目标检测框架（如SSD/Yolo系列）与双阶段目标检测框架（如Faster-RCNN/R-FCN等）accuracy gap的最大原因。在Focal loss提出之前，已有的目标检测网络都是通过像Boot strapping/Hard example mining等方法来解决此问题的。作者通过后续实验成功表明Focal loss可在单阶段目标检测网络中成功使用，并最终能以更快的速率实现与双阶段目标检测网络近似或更优的效果。

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日