Machine learning based language models have recently made significant progress, which introduces a danger to spread misinformation. To combat this potential danger, several methods have been proposed for detecting text written by these language models. This paper presents two classes of black-box attacks on these detectors, one which randomly replaces characters with homoglyphs, and the other a simple scheme to purposefully misspell words. The homoglyph and misspelling attacks decrease a popular neural text detector's recall on neural text from 97.44% to 0.26% and 22.68%, respectively. Results also indicate that the attacks are transferable to other neural text detectors.
翻译:基于机器的学习语言模型最近取得了重大的进展,这带来了传播错误信息的危险。为了应对这一潜在危险,提出了几种方法来检测这些语言模型所写的文字。本文介绍了两种类型的黑盒攻击这些探测器,一种是随机用同质文字取代字符,另一种是故意错写单词的简单办法。同质文字和错写攻击使流行神经文字探测器的记忆分别从97.44%下降到0.26%和22.68%。结果还表明这些攻击可以转移到其他神经文字探测器。