AI生成的文本是否能够被可靠地检测到？ (Can AI-Generated Text be Reliably Detected?)

The rapid progress of Large Language Models (LLMs) has made them capable of performing astonishingly well on various tasks including document completion and question answering. The unregulated use of these models, however, can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Therefore, reliable detection of AI-generated text can be critical to ensure the responsible use of LLMs. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques that imprint specific patterns onto them. In this paper, both empirically and theoretically, we show that these detectors are not reliable in practical scenarios. Empirically, we show that paraphrasing attacks, where a light paraphraser is applied on top of the generative text model, can break a whole range of detectors, including the ones using the watermarking schemes as well as neural network-based detectors and zero-shot classifiers. We then provide a theoretical impossibility result indicating that for a sufficiently good language model, even the best-possible detector can only perform marginally better than a random classifier. Finally, we show that even LLMs protected by watermarking schemes can be vulnerable against spoofing attacks where adversarial humans can infer hidden watermarking signatures and add them to their generated text to be detected as text generated by the LLMs, potentially causing reputational damages to their developers. We believe these results can open an honest conversation in the community regarding the ethical and reliable use of AI-generated text.

翻译：大规模语言模型（LLMs）的快速进展使它们能够在各种任务中表现出惊人的效果，包括文档完成和问答。然而，这些模型的未受监管的使用可能会潜在地导致恶意后果，如抄袭、生成虚假新闻、垃圾邮件等。因此，可靠地检测AI生成的文本对确保LLMs的负责任使用至关重要。最近的一些工作尝试通过使用生成文本输出中存在的某些模型标志或通过应用可在其上刻特定模式的水印技术来解决这个问题。在本文中，我们从经验和理论上都表明，在实际情况下，这些检测器都是不可靠的。从经验上来看，我们证明了这些检测器都可以通过轻量级的转述器进行攻击，包括使用水印方案的检测器、神经网络检测器和零-shot分类器。然后，我们提供了一个理论上的不可能结果，表明对于一个足够好的语言模型来说，即使是最好的检测器也只能比随机分类器略好一点。最后，我们表明，即使是受到水印方案保护的LLMs也会容易受到欺骗攻击，其中对抗人类可以推断出隐藏的水印签名并将其添加到他们生成的文本中以被检测为由LLMs生成的文本，从而可能对它们的开发者造成声誉损害。我们相信这些结果可以在社区中开展有关AI生成文本的道德和可靠使用的诚实交谈。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【Hugging Face】指导文本生成与约束波束搜索🤗Transformers，Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

专知会员服务

22+阅读 · 2022年3月18日

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

【论文】持续学习的图神经网络用于检测社交媒体的假新闻，Graph Neural Networks with Continual Learning for Fake News Detection from Social Media

专知会员服务

41+阅读 · 2020年7月14日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日