The last few years have witnessed an exponential rise in the propagation of offensive text on social media. Identification of this text with high precision is crucial for the well-being of society. Most of the existing approaches tend to give high toxicity scores to innocuous statements (e.g., "I am a gay man"). These false positives result from over-generalization on the training data where specific terms in the statement may have been used in a pejorative sense (e.g., "gay"). Emphasis on such words alone can lead to discrimination against the classes these systems are designed to protect. In this paper, we address the problem of offensive language detection on Twitter, while also detecting the type and the target of the offence. We propose a novel approach called SyLSTM, which integrates syntactic features in the form of the dependency parse tree of a sentence and semantic features in the form of word embeddings into a deep learning architecture using a Graph Convolutional Network. Results show that the proposed approach significantly outperforms the state-of-the-art BERT model with orders of magnitude fewer number of parameters.
翻译:过去几年来,在社交媒体上传播的冒犯性文字呈指数式上升趋势。 以高度精确的方式识别这一文字对于社会福祉至关重要。 多数现有办法往往给无害言论带来高毒性分数(例如“我是一个同性恋者 ” ) 。 这些虚假的正面效应来自对培训数据的过于笼统化,在培训数据中,语句中的具体术语可能以贬义性意义(例如“gay” ) 使用。 仅强调这些词就会导致对这些系统设计要保护的类别的歧视。 在本文中,我们在Twitter上处理攻击性语言探测问题,同时发现犯罪的类型和目标。 我们提出了一种叫作SYLSTM的新办法,它以判决依赖性粗糙的树和语义特征的形式,即用图形革命网络将文字嵌入深层学习结构。结果显示,拟议的方法大大超出了最先进的BERT模型的尺寸,而参数数量则较少。