The datasets most widely used for abusive language detection contain lists of messages, usually tweets, that have been manually judged as abusive or not by one or more annotators, with the annotation performed at message level. In this paper, we investigate what happens when the hateful content of a message is judged also based on the context, given that messages are often ambiguous and need to be interpreted in the context of occurrence. We first re-annotate part of a widely used dataset for abusive language detection in English in two conditions, i.e. with and without context. Then, we compare the performance of three classification algorithms obtained on these two types of dataset, arguing that a context-aware classification is more challenging but also more similar to a real application scenario.
翻译:在本文中,我们调查在根据背景判断信息内容时会发生什么情况,因为电文往往含混不清,需要根据发生时的情况加以解释。我们首先对在两种条件下,即与上下文或无上下文的情况下,在英语中广泛使用的滥用语言探测数据集的一部分进行重新说明。然后,我们比较这两类数据集所获得的三种分类算法的性能,认为对上下文的认识分类更具有挑战性,但也更类似于实际应用情景。