越南社会媒体文字仇恨言论探测大规模数据集 (A Large-scale Dataset for Hate Speech Detection on Vietnamese Social Media Texts)

In recent years, Vietnam witnesses the mass development of social network users on different social platforms such as Facebook, Youtube, Instagram, and Tiktok. On social medias, hate speech has become a critical problem for social network users. To solve this problem, we introduce the ViHSD - a human-annotated dataset for automatically detecting hate speech on the social network. This dataset contains over 30,000 comments, each comment in the dataset has one of three labels: CLEAN, OFFENSIVE, or HATE. Besides, we introduce the data creation process for annotating and evaluating the quality of the dataset. Finally, we evaluated the dataset by deep learning models and transformer models.

翻译：近年来,越南见证了社会网络用户在脸书、Youtube、Instagram和Tiktok等不同社会平台上的大规模发展。在社交媒体上,仇恨言论已成为社会网络用户的关键问题。为了解决这个问题,我们引入了ViHSD这个带有人文附加说明的数据集,用于自动检测社交网络上的仇恨言论。这个数据集包含30,000多条评论,数据集中的每个评论都有三个标签之一:CLEAN、OFENSIVE或HATE。此外,我们引入了数据创建程序,用于说明和评估数据集的质量。最后,我们评估了深层学习模型和变异模型的数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

面向大数据存储的大型元数据服务器的研究，A Survey on Large Scale Metadata Server for Big Data Storage

专知会员服务

9+阅读 · 2020年5月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【东大-UCSB】虚假新闻检测的自然语言处理研究综述，A Survey on Natural Language Processing for Fake News Detection

专知会员服务

79+阅读 · 2020年2月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日