以资源不足的Kannada语言检测希望言语 (Hope Speech detection in under-resourced Kannada language)

Numerous methods have been developed to monitor the spread of negativity in modern years by eliminating vulgar, offensive, and fierce comments from social media platforms. However, there are relatively lesser amounts of study that converges on embracing positivity, reinforcing supportive and reassuring content in online forums. Consequently, we propose creating an English-Kannada Hope speech dataset, KanHope and comparing several experiments to benchmark the dataset. The dataset consists of 6,176 user-generated comments in code mixed Kannada scraped from YouTube and manually annotated as bearing hope speech or Not-hope speech. In addition, we introduce DC-BERT4HOPE, a dual-channel model that uses the English translation of KanHope for additional training to promote hope speech detection. The approach achieves a weighted F1-score of 0.756, bettering other models. Henceforth, KanHope aims to instigate research in Kannada while broadly promoting researchers to take a pragmatic approach towards online content that encourages, positive, and supportive.

翻译：通过消除社会媒体平台的粗俗、冒犯和激烈评论,开发了许多方法来监测消极主义在现代年的传播,消除了社会媒体平台的粗俗、冒犯和激烈评论,然而,在接受现实主义、加强在线论坛支持和令人放心的内容方面,研究数量相对较少,因此,我们提议建立一个英语-Kannada希望语言数据集,KanHope并比较数个实验以作为数据集的基准,数据集包括由用户生成的6 176个评论,这些评论来自从YouTube中分离出来的康纳达混合代码,以及手动附加注释的带有希望演讲或非希望演讲的内容。此外,我们引入了DC-BERT4HOPE,这是一个双通道模型,使用KanHope的英语译文进行额外培训,以促进对希望言论的探测。该方法实现了0.756的加权F1分数,从而改进了其他模型。因此,KanHope的目的是在Kannada启动研究,同时广泛推动研究人员对在线内容采取鼓励、积极和支持的务实做法。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

因果图，Causal Graphs，52页ppt

专知会员服务

253+阅读 · 2020年4月19日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日