Offensive language detection and prevention becomes increasing critical for maintaining a healthy social platform and the safe deployment of language models. Despite plentiful researches on toxic and offensive language problem in NLP, existing studies mainly focus on English, while few researches involve Chinese due to the limitation of resources. To facilitate Chinese offensive language detection and model evaluation, we collect COLDataset, a Chinese offensive language dataset containing 37k annotated sentences. With this high-quality dataset, we provide a strong baseline classifier, COLDetector, with 81% accuracy for offensive language detection. Furthermore, we also utilize the proposed \textsc{COLDetector} to study output offensiveness of popular Chinese language models (CDialGPT and CPM). We find that (1) CPM tends to generate more offensive output than CDialGPT, and (2) certain type of prompts, like anti-bias sentences, can trigger offensive outputs more easily.Altogether, our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models. Disclaimer: The paper contains example data that may be considered profane, vulgar, or offensive.
翻译:攻击性语言的探测和预防对维护健康的社会平台和安全部署语言模式变得日益重要。尽管对全国语言方案有毒和攻击性语言问题进行了大量研究,但现有的研究主要侧重于英语,而由于资源有限,很少有研究涉及中文。为了便利中国进攻性语言的探测和模型评估,我们收集了中国进攻性语言数据集COLDataset,这是一个包含37k个附加说明的句子的中国进攻性语言数据集。有了这个高质量的数据集,我们提供了一个强大的基线分类器COL探测器,其中81%的精确度用于攻击性语言的检测。此外,我们还利用拟议的\ textsc{COLSetor}来研究流行中文模式(CDIGPT和CPM)的输出冒犯性。我们发现:(1) 计算机制造的输出往往比CDAIGPT产生更多的冒犯性输出,(2) 某些类型的提示,如反偏见的句子,可以更容易触发攻击性输出。加起来,我们的资源和分析旨在帮助中国在线社区解毒,并评估变异性语言模式的安全性。