Automated hate speech detection is an important tool in combating the spread of hate speech, particularly in social media. Numerous methods have been developed for the task, including a recent proliferation of deep-learning based approaches. A variety of datasets have also been developed, exemplifying various manifestations of the hate-speech detection problem. We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods, mediated through the three most commonly used datasets. Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art. We particularly focus our analysis on measures of practical performance, including detection accuracy, computational efficiency, capability in using pre-trained models, and domain generalization. In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions. Code and dataset are available at https://github.com/jmjmalik22/Hate-Speech-Detection.
翻译:自动检测仇恨言论是遏制仇恨言论传播,特别是在社交媒体中传播仇恨言论的一个重要工具,为此开发了多种方法,包括最近大量采用深层学习方法,还开发了各种数据集,展示了仇恨言论检测问题的各种表现形式。我们在此对通过三种最常用的数据集对深层和浅层仇恨言论检测方法进行大规模的经验比较。我们的目标是阐明该领域的进展,并查明当前最新技术的优缺点。我们特别侧重于对实际绩效措施的分析,包括检测准确性、计算效率、使用预先培训模式的能力以及域域化。我们这样做的目的是指导在实践中使用仇恨言论检测方法,量化最新技术,并确定今后的研究方向。代码和数据集可在https://github.com/jmjmalik22/Hate-Speech-探测器查阅。