Online texts with toxic content are a threat in social media that might cause cyber harassment. Although many platforms applied measures, such as machine learning-based hate-speech detection systems, to diminish their effect, those toxic content publishers can still evade the system by modifying the spelling of toxic words. Those modified words are also known as human-written text perturbations. Many research works developed certain techniques to generate adversarial samples to help the machine learning models obtain the ability to recognize those perturbations. However, there is still a gap between those machine-generated perturbations and human-written perturbations. In this paper, we introduce a benchmark test set containing human-written perturbations online for toxic speech detection models. We also recruited a group of workers to evaluate the quality of this test set and dropped low-quality samples. Meanwhile, to check if our perturbation can be normalized to its clean version, we applied spell corrector algorithms on this dataset. Finally, we test this data on state-of-the-art language models, such as BERT and RoBERTa, and black box APIs, such as perspective API, to demonstrate the adversarial attack with real human-written perturbations is still effective.
翻译:摘要:具有有害内容的在线文本是社交媒体中的威胁,可能会导致网络骚扰。虽然许多平台采取了措施,如基于机器学习的仇恨言论检测系统,以减少其影响,但那些有害内容发布者仍可以通过修改有毒词汇的拼写来规避系统。这些修改后的单词也称为人工编写的文本扰动。许多研究工作开发了某些技术来生成对抗样本,以帮助机器学习模型获得识别这些扰动的能力。然而,人工编写的扰动与机器生成的扰动之间仍存在差距。在本文中,我们介绍了一个包含在线人工编写扰动的基准测试集,用于测试有害言论检测模型。我们还招募了一组工人来评估此测试集的质量并删除质量低的样本。同时,为了检查我们的扰动是否可以被规范化为其原始版本,我们在该数据集上应用了拼写校正算法。最后,我们将此数据集测试于最先进的语言模型(如 BERT 和 RoBERTa)和黑盒 API(如 perspective API),以证明使用真正的人工编写的扰动进行对抗攻击仍然有效。