In this paper we present a benchmark dataset generated as part of a project for automatic identification of misogyny within online content, which focuses in particular on memes. The benchmark here described is composed of 800 memes collected from the most popular social media platforms, such as Facebook, Twitter, Instagram and Reddit, and consulting websites dedicated to collection and creation of memes. To gather misogynistic memes, specific keywords that refer to misogynistic content have been considered as search criterion, considering different manifestations of hatred against women, such as body shaming, stereotyping, objectification and violence. In parallel, memes with no misogynist content have been manually downloaded from the same web sources. Among all the collected memes, three domain experts have selected a dataset of 800 memes equally balanced between misogynistic and non-misogynistic ones. This dataset has been validated through a crowdsourcing platform, involving 60 subjects for the labelling process, in order to collect three evaluations for each instance. Two further binary labels have been collected from both the experts and the crowdsourcing platform, for memes evaluated as misogynistic, concerning aggressiveness and irony. Finally for each meme, the text has been manually transcribed. The dataset provided is thus composed of the 800 memes, the labels given by the experts and those obtained by the crowdsourcing validation, and the transcribed texts. This data can be used to approach the problem of automatic detection of misogynistic content on the Web relying on both textual and visual cues, facing phenomenons that are growing every day such as cybersexism and technology-facilitated violence.
翻译:在本文中,我们介绍了一个基准数据集,这是自动识别在线内容内不孕不育现象项目的一部分,该项目特别侧重于Memes。这里描述的基准由从Facebook、Twitter、Instagram和Reddit等最受欢迎的社交媒体平台收集的800个Memes组成,以及专门收集和创建Memes的咨询网站。为了收集不相识的Memes,提到不相识内容的具体关键字被视为搜索标准,考虑到对妇女的仇恨的不同表现,例如身体毁损、定型、目标化和暴力。与此同时,从同一个网络来源手动下载了没有不相识性内容的图像。在所有收集的Memes中,有3个域专家选择了800个数据集,这些数据集同样平衡地用于收集和创建Memesmme。这个包含60个标签过程主题的特定关键字被验证,以便收集每例问题的三种评价。从专家和众包平台又收集了另外两个两条直线标签标签标签,用于作为错误的图像检测和图解的文本,因此,每个图解的图解的文本都用于我和图解的图案。