In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the "type" of discursive role that the comment is performing with respect to the previous comment. The initial dataset, being discussed here (and made available as part of the ComMA@ICON shared task), consists of a total 15,000 annotated comments in four languages - Meitei, Bangla, Hindi, and Indian English - collected from various social media platforms such as YouTube, Facebook, Twitter and Telegram. As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English. The paper gives a detailed description of the tagset being used for annotation and also the process of developing a multi-label, fine-grained tagset that can be used for marking comments with aggression and bias of various kinds including gender bias, religious intolerance (called communal bias in the tagset), class/caste bias and ethnic/racial bias. We also define and discuss the tags that have been used for marking different the discursive role being performed through the comments, such as attack, defend, etc. We also present a statistical analysis of the dataset as well as results of our baseline experiments with developing an automatic aggression identification system using the dataset developed.
翻译:在本文中,我们讨论开发一个多语种的附加注释的数据集,并配有等级、细微的标签,标记不同类型的侵略,以及发生这些侵略的“背景”。这里,背景由发表具体评论的谈话线索以及评论对先前评论所起作用的“类型”来界定。这里讨论的初始数据集(作为ComMA@ICON共同任务的一部分提供)包括总共15 000种四种语言的附加注释的评论——梅蒂、邦格拉、印地语和印度英语——从YouTube、Facebook、Twitter和Telegram等各种社会媒体平台中收集。正如在社交媒体网站上通常使用的那样,这些评论中有许多是多语种的,大多与英文编码混在一起。本文详细描述了用于说明的标签集以及开发一个多标签、细微标签标签标签的标记,可以用来标记各种侵略和偏见的评论,包括性别偏见、宗教不容忍(在标签设置中所谓的社区偏见)、阶级/种姓偏见和Telegramme 。我们用不同的标签和种族/种族偏见来界定了我们目前所使用的标记标签的标签和标记的标签的标签,作为不偏差性分析。我们还用不同的标签和种族/种族偏见来界定了我们所使用的标记的标签上的标签和标记,作为标记和种族偏见。