TweetBLM:Twitter上的仇恨言论数据集和黑人生活分析 (TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter)

In the past few years, there has been a significant rise in toxic and hateful content on various social media platforms. Recently Black Lives Matter movement came into the picture, causing an avalanche of user generated responses on the internet. In this paper, we have proposed a Black Lives Matter related tweet hate speech dataset TweetBLM. Our dataset comprises 9165 manually annotated tweets that target the Black Lives Matter movement. We annotated the tweets into two classes, i.e., HATE and NONHATE based on their content related to racism erupted from the movement for the black community. In this work, we also generated useful statistical insights on our dataset and performed a systematic analysis of various machine learning models such as Random Forest, CNN, LSTM, BiLSTM, Fasttext, BERTbase, and BERTlarge for the classification task on our dataset. Through our work, we aim at contributing to the substantial efforts of the research community for the identification and mitigation of hate speech on the internet. The dataset is publicly available.

翻译：在过去几年里,各种社交媒体平台上的有毒和仇恨内容大幅增加。最近黑生命物质运动出现,导致用户在互联网上的反应暴升。在本文中,我们提出了“黑生命物质”相关推特仇恨言论数据集TweetBLM。我们的数据集包括9165个人工推文,针对黑生命物质运动的附加说明的推文。我们根据黑社会运动中与种族主义有关的内容,将推文分为两类,即HATE和非HATE。在这项工作中,我们还生成了有关我们数据集的有用统计见解,并对各种机器学习模型进行了系统分析,如随机森林、CNN、LSTM、BILLSTM、Fastext、BERTbase、BERTbase和BERTmoth等,用于我们数据集的分类任务。我们通过我们的工作,致力于促进研究界为识别和缓解互联网上的仇恨言论做出大量努力。数据集可供公众查阅。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【医学图像处理中的因果性】52页ppt，Causality Matters in Medical Imaging

专知会员服务

60+阅读 · 2020年3月14日

【东大-UCSB】虚假新闻检测的自然语言处理研究综述，A Survey on Natural Language Processing for Fake News Detection

专知会员服务

79+阅读 · 2020年2月12日

【MIT深度学习课程】深度序列建模，Deep Sequence Modeling

专知会员服务

78+阅读 · 2020年2月3日