In this paper, we introduce GradEscape, the first gradient-based evader designed to attack AI-generated text (AIGT) detectors. GradEscape overcomes the undifferentiable computation problem, caused by the discrete nature of text, by introducing a novel approach to construct weighted embeddings for the detector input. It then updates the evader model parameters using feedback from victim detectors, achieving high attack success with minimal text modification. To address the issue of tokenizer mismatch between the evader and the detector, we introduce a warm-started evader method, enabling GradEscape to adapt to detectors across any language model architecture. Moreover, we employ novel tokenizer inference and model extraction techniques, facilitating effective evasion even in query-only access. We evaluate GradEscape on four datasets and three widely-used language models, benchmarking it against four state-of-the-art AIGT evaders. Experimental results demonstrate that GradEscape outperforms existing evaders in various scenarios, including with an 11B paraphrase model, while utilizing only 139M parameters. We have successfully applied GradEscape to two real-world commercial AIGT detectors. Our analysis reveals that the primary vulnerability stems from disparity in text expression styles within the training data. We also propose a potential defense strategy to mitigate the threat of AIGT evaders. We open-source our GradEscape for developing more robust AIGT detectors.
翻译:本文提出GradEscape,这是首个基于梯度的规避方法,旨在攻击AI生成文本(AIGT)检测器。GradEscape通过引入一种新颖的方法为检测器输入构建加权嵌入,克服了由文本离散性导致的不可微计算问题。随后,它利用受害检测器的反馈更新规避模型参数,在最小化文本修改的情况下实现高攻击成功率。为解决规避器与检测器之间分词器不匹配的问题,我们引入了热启动规避器方法,使GradEscape能够适应任何语言模型架构的检测器。此外,我们采用新颖的分词器推断和模型提取技术,即使在仅查询访问的情况下也能实现有效规避。我们在四个数据集和三种广泛使用的语言模型上评估GradEscape,并与四种最先进的AIGT规避方法进行基准比较。实验结果表明,GradEscape在各种场景下均优于现有规避方法,包括使用110亿参数的复述模型时,而自身仅需1.39亿参数。我们已成功将GradEscape应用于两个现实商业AIGT检测器。分析表明,主要漏洞源于训练数据中文本表达风格的差异。我们还提出了一种潜在的防御策略以减轻AIGT规避器的威胁。我们开源了GradEscape,以促进开发更鲁棒的AIGT检测器。