Online Hate speech detection has become important with the growth of digital devices, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides multi-label classification from 1 to 4 labels, and handling subjectivity and intersectionality. We evaluate strong baselines on K-MHaS. KR-BERT with sub-character tokenizer outperforms, recognising decomposed characters in each hate speech class.
翻译:随着数字设备的发展,在线仇恨言论检测变得非常重要,但英语以外语言的资源极为有限。我们引入了K-MHAS,这是一个用于检测仇恨言论的新的多标签数据集,可有效处理韩国语言模式。该数据集包含109千字节的新闻评论,提供从1到4个标签的多标签分类,并处理主观性和交叉性。我们用子字符标识器外形来评估K-MHAS. KR-BERT的强度基线,在每个仇恨言论类中识别不相容的人物。