SWSR: 用于在线性别主义探测的中国数据集和词汇 (SWSR: A Chinese Dataset and Lexicon for Online Sexism Detection)

Online sexism has become an increasing concern in social media platforms as it has affected the healthy development of the Internet and can have negative effects in society. While research in the sexism detection domain is growing, most of this research focuses on English as the language and on Twitter as the platform. Our objective here is to broaden the scope of this research by considering the Chinese language on Sina Weibo. We propose the first Chinese sexism dataset -- Sina Weibo Sexism Review (SWSR) dataset --, as well as a large Chinese lexicon SexHateLex made of abusive and gender-related terms. We introduce our data collection and annotation process, and provide an exploratory analysis of the dataset characteristics to validate its quality and to show how sexism is manifested in Chinese. The SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language. We conduct experiments for the three sexism classification tasks making use of state-of-the-art machine learning models. Our results show competitive performance, providing a benchmark for sexism detection in the Chinese language, as well as an error analysis highlighting open challenges needing more research in Chinese NLP. The SWSR dataset and SexHateLex lexicon are publicly available.

翻译：在社会媒体平台上,在线性别主义日益成为人们日益关注的一个问题,因为它已经影响到互联网的健康发展,并可能对社会产生消极影响。虽然对性别主义检测领域的研究正在增加,但大部分研究侧重于英语作为语言,在推特作为平台。我们在这里的目标是通过考虑Sina Weibo的中文,扩大这一研究的范围。我们提出了首个中国性别主义数据集 -- -- Sina Weibo性别主义审查数据集,以及一个以虐待和性别相关术语制作的庞大的中国性别词汇系统SexHateLex。我们介绍了我们的数据收集和注解进程,并对数据集特征进行了探索性分析,以验证其质量,并展示性别歧视在中文中的表现。SWSR数据集提供了不同层次的微粒化标签,包括:(一) 性别主义或非性别主义,(二) 性别主义类别和(三) 目标类型,除其他外,可以用来建立计算方法,以查明和调查与性别有关的细微性别相关的语言。我们用三个性别行特征分析模型进行实验,作为中国性别学的性别特征测试和性别学高级分析的模型,作为中国性别特征分析的学习结果。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

生成式对抗网络异常检测，GANs for Anomaly Detection

专知会员服务

34+阅读 · 2021年9月16日

【东大-UCSB】虚假新闻检测的自然语言处理研究综述，A Survey on Natural Language Processing for Fake News Detection

专知会员服务

79+阅读 · 2020年2月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

31+阅读 · 2019年10月17日