The rapid progress of Large Language Models (LLMs) has made them capable of performing astonishingly well on various tasks including document completion and question answering. The unregulated use of these models, however, can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Therefore, reliable detection of AI-generated text can be critical to ensure the responsible use of LLMs. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques that imprint specific patterns onto them. In this paper, both empirically and theoretically, we show that these detectors are not reliable in practical scenarios. Empirically, we show that paraphrasing attacks, where a light paraphraser is applied on top of the generative text model, can break a whole range of detectors, including the ones using the watermarking schemes as well as neural network-based detectors and zero-shot classifiers. We then provide a theoretical impossibility result indicating that for a sufficiently good language model, even the best-possible detector can only perform marginally better than a random classifier. Finally, we show that even LLMs protected by watermarking schemes can be vulnerable against spoofing attacks where adversarial humans can infer hidden watermarking signatures and add them to their generated text to be detected as text generated by the LLMs, potentially causing reputational damages to their developers. We believe these results can open an honest conversation in the community regarding the ethical and reliable use of AI-generated text.
翻译:大规模语言模型(LLMs)的快速进展使它们能够在各种任务中表现出惊人的效果,包括文档完成和问答。然而,这些模型的未受监管的使用可能会潜在地导致恶意后果,如抄袭、生成虚假新闻、垃圾邮件等。因此,可靠地检测AI生成的文本对确保LLMs的负责任使用至关重要。最近的一些工作尝试通过使用生成文本输出中存在的某些模型标志或通过应用可在其上刻特定模式的水印技术来解决这个问题。在本文中,我们从经验和理论上都表明,在实际情况下,这些检测器都是不可靠的。从经验上来看,我们证明了这些检测器都可以通过轻量级的转述器进行攻击,包括使用水印方案的检测器、神经网络检测器和零-shot分类器。然后,我们提供了一个理论上的不可能结果,表明对于一个足够好的语言模型来说,即使是最好的检测器也只能比随机分类器略好一点。最后,我们表明,即使是受到水印方案保护的LLMs也会容易受到欺骗攻击,其中对抗人类可以推断出隐藏的水印签名并将其添加到他们生成的文本中以被检测为由LLMs生成的文本,从而可能对它们的开发者造成声誉损害。我们相信这些结果可以在社区中开展有关AI生成文本的道德和可靠使用的诚实交谈。