Emojis作为侦察阿拉伯进攻性语言和仇恨言论的助推者 (Emojis as Anchors to Detect Arabic Offensive Language and Hate Speech)

We introduce a generic, language-independent method to collect a large percentage of offensive and hate tweets regardless of their topics or genres. We harness the extralinguistic information embedded in the emojis to collect a large number of offensive tweets. We apply the proposed method on Arabic tweets and compare it with English tweets -- analyzing some cultural differences. We observed a constant usage of these emojis to represent offensiveness in throughout different timelines in Twitter. We manually annotate and publicly release the largest Arabic dataset for offensive, fine-grained hate speech, vulgar and violence content. Furthermore, we benchmark the dataset for detecting offense and hate speech using different transformer architectures and performed in-depth linguistic analysis. We evaluate our models on external datasets -- a Twitter dataset collected using a completely different method, and a multi-platform dataset containing comments from Twitter, YouTube and Facebook, for assessing generalization capability. Competitive results on these datasets suggest that the data collected using our method captures universal characteristics of offensive language. Our findings also highlight the common words used in offensive communications; common targets for hate speech; specific patterns in violence tweets and pinpoints common classification errors due to the need to understand the context, consider culture and background and the presence of sarcasm among others.

翻译：我们采用通用的、不依赖语言的方法,收集大量攻击性和仇恨的推文,而不论其主题或类型如何。我们利用在emojis中嵌入的超语言信息收集大量攻击性推文。我们采用拟议的阿拉伯语推文方法,并将之与英文推文进行比较 -- -- 分析一些文化差异。我们观察到,在Twitter的不同时间里,不断使用这些推文来代表冒犯性。我们手动注解并公开发布关于攻击性、细微仇恨言论、粗俗和暴力内容的最大阿拉伯数据集。此外,我们利用不同的变异器结构为识别犯罪和仇恨言论的数据集基准,并进行深入的语言分析。我们评估外部数据集的模式 -- -- 使用完全不同的方法收集的推特数据集,以及包含Twitter、YouTube和Facebook评论的多平台数据集,以评估一般化能力。这些数据集的竞争性结果表明,用我们的方法收集的数据可以捕捉到攻击性语言的普遍特征。我们的调查结果还突出了在攻击性通信中使用的共同词;共同的表达仇恨情绪和背景错误的目标;在共同的推文中考虑暴力中的具体模式以及其它背景分类。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

“CVPR 2021 接受论文列表 1663篇论文都在这了

专知会员服务

32+阅读 · 2021年6月12日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日