This paper describes a corpus annotation process to support the identification of hate speech and offensive language in social media. In addition, we provide the first robust corpus this kind for the Brazilian Portuguese language. The corpus was collected from Instagram pages of political personalities and manually annotated, being composed by 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive language), the level of offense (highly offensive, moderately offensive and slightly offensive messages), and the identification regarding the target of the discriminatory content (xenophobia, racism, homophobia, sexism, religion intolerance, partyism, apology to the dictatorship, antisemitism and fat phobia). Each comment was annotated by three different annotators, which achieved high inter-annotator agreement. The proposed annotation process is also language and domain independent.
翻译:本文描述了支持在社交媒体中识别仇恨言论和冒犯性语言的文体说明程序;此外,我们为巴西葡萄牙语提供了首个强有力的文体;该文体是从Instagram政治人物的网页上收集的,手动附加说明,由7 000份按三个不同层面附加说明的文件组成:二元分类(攻击性语言和非攻击性语言)、犯罪程度(攻击性很强、中度攻击性和轻微攻击性信息),以及确定歧视性内容的目标(仇视性、种族主义、仇视同性恋、性别主义、宗教不容忍、党派主义、向独裁道歉、反犹太主义和仇视脂肪),每份评论都由三个不同的注解者附加说明,达成了高度的跨广告协议,拟议的注解过程也是语言和领域独立的。