To detect the deployment of large language models for malicious use cases (e.g., fake content creation or academic plagiarism), several approaches have recently been proposed for identifying AI-generated text via watermarks or statistical irregularities. How robust are these detection algorithms to paraphrases of AI-generated text? To stress test these detectors, we first train an 11B parameter paraphrase generation model (DIPPER) that can paraphrase paragraphs, optionally leveraging surrounding text (e.g., user-written prompts) as context. DIPPER also uses scalar knobs to control the amount of lexical diversity and reordering in the paraphrases. Paraphrasing text generated by three large language models (including GPT3.5-davinci-003) with DIPPER successfully evades several detectors, including watermarking, GPTZero, DetectGPT, and OpenAI's text classifier. For example, DIPPER drops the detection accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. To increase the robustness of AI-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider. Given a candidate text, our algorithm searches a database of sequences previously generated by the API, looking for sequences that match the candidate text within a certain threshold. We empirically verify our defense using a database of 15M generations from a fine-tuned T5-XXL model and find that it can detect 80% to 97% of paraphrased generations across different settings, while only classifying 1% of human-written sequences as AI-generated. We will open source our code, model and data for future research.
翻译:为了检测大型语言模型用于恶意用途(例如虚假内容创建或学术抄袭),最近提出了几种方法来通过水印或统计异常来识别AI生成的文本。这些检测算法对AI生成文本的改写有多强健?为了对这些检测器进行压力测试,我们首先训练了一个11B参数的改写生成模型(DIPPER),该模型可以改写段落,可以选择利用周围的文本(例如用户编写的提示)作为上下文。DIPPER还使用标量旋钮来控制改写中的词汇多样性和重排。通过使用DIPPER改写三个大型语言模型生成的文本,包括GPT3.5-davinci-003,成功避开了几个检测器,包括水印、GPTZero、DetectGPT和OpenAI的文本分类器。例如,DIPPER将DetectGPT的检测准确率从70.3%降至4.6%(在误报率为1%的情况下),而几乎不会修改输入语义。为了增强对AI生成文本检测的抵御改写攻击的强度,我们介绍了一种简单的防御措施,该措施依赖于检索语义上相似的文本生成,并且必须由语言模型API提供者维护。给定候选文本,我们的算法搜索之前由API生成的序列数据库,寻找在一定阈值内与候选文本匹配的序列。我们使用经过微调的T5-XXL模型的1500万个生成的文本来经验验证了我们的防御措施,并发现它可以在不同的设置下检测到80%到97%的改写生成文本,同时仅将1%的人类编写序列分类为AI生成。我们将开源我们的代码、模型和数据以供未来研究参考。