GradEscape：一种基于梯度的对抗AI生成文本检测器的规避方法 (GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors)

In this paper, we introduce GradEscape, the first gradient-based evader designed to attack AI-generated text (AIGT) detectors. GradEscape overcomes the undifferentiable computation problem, caused by the discrete nature of text, by introducing a novel approach to construct weighted embeddings for the detector input. It then updates the evader model parameters using feedback from victim detectors, achieving high attack success with minimal text modification. To address the issue of tokenizer mismatch between the evader and the detector, we introduce a warm-started evader method, enabling GradEscape to adapt to detectors across any language model architecture. Moreover, we employ novel tokenizer inference and model extraction techniques, facilitating effective evasion even in query-only access. We evaluate GradEscape on four datasets and three widely-used language models, benchmarking it against four state-of-the-art AIGT evaders. Experimental results demonstrate that GradEscape outperforms existing evaders in various scenarios, including with an 11B paraphrase model, while utilizing only 139M parameters. We have successfully applied GradEscape to two real-world commercial AIGT detectors. Our analysis reveals that the primary vulnerability stems from disparity in text expression styles within the training data. We also propose a potential defense strategy to mitigate the threat of AIGT evaders. We open-source our GradEscape for developing more robust AIGT detectors.

翻译：本文提出GradEscape，这是首个基于梯度的规避方法，旨在攻击AI生成文本（AIGT）检测器。GradEscape通过引入一种新颖的方法为检测器输入构建加权嵌入，克服了由文本离散性导致的不可微计算问题。随后，它利用受害检测器的反馈更新规避模型参数，在最小化文本修改的情况下实现高攻击成功率。为解决规避器与检测器之间分词器不匹配的问题，我们引入了热启动规避器方法，使GradEscape能够适应任何语言模型架构的检测器。此外，我们采用新颖的分词器推断和模型提取技术，即使在仅查询访问的情况下也能实现有效规避。我们在四个数据集和三种广泛使用的语言模型上评估GradEscape，并与四种最先进的AIGT规避方法进行基准比较。实验结果表明，GradEscape在各种场景下均优于现有规避方法，包括使用110亿参数的复述模型时，而自身仅需1.39亿参数。我们已成功将GradEscape应用于两个现实商业AIGT检测器。分析表明，主要漏洞源于训练数据中文本表达风格的差异。我们还提出了一种潜在的防御策略以减轻AIGT规避器的威胁。我们开源了GradEscape，以促进开发更鲁棒的AIGT检测器。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日