Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baseline experiments on K-MHaS using Korean-BERT-based language models with six different metrics. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.
翻译:由于在线内容的增长,在线仇恨言论检测已成为一个重要问题,但英语以外语言的资源极为有限。我们引入了K-MHAS,这是一个用于检测仇恨言论的新的多标签数据集,可有效处理韩国语言模式。该数据集包含109k条来自新闻评论的语句,提供使用1至4个标签的多标签分类,并处理主观性和交叉性。我们用基于韩国-BERT的语言模型用六种不同的度量来评估K-MHAS的强力基线实验。 KR-BERT, 配有子字符符号符号比其他人更优, 承认每个仇恨言论类中腐烂的人物。