自适应检测用于分割语言模型水印文本 (Adaptive Testing for Segmenting Watermarked Texts From Language Models)

The rapid adoption of large language models (LLMs), such as GPT-4 and Claude 3.5, underscores the need to distinguish LLM-generated text from human-written content to mitigate the spread of misinformation and misuse in education. One promising approach to address this issue is the watermark technique, which embeds subtle statistical signals into LLM-generated text to enable reliable identification. In this paper, we first generalize the likelihood-based LLM detection method of a previous study by introducing a flexible weighted formulation, and further adapt this approach to the inverse transform sampling method. Moving beyond watermark detection, we extend this adaptive detection strategy to tackle the more challenging problem of segmenting a given text into watermarked and non-watermarked substrings. In contrast to the approach in a previous study, which relies on accurate estimation of next-token probabilities that are highly sensitive to prompt estimation, our proposed framework removes the need for precise prompt estimation. Extensive numerical experiments demonstrate that the proposed methodology is both effective and robust in accurately segmenting texts containing a mixture of watermarked and non-watermarked content.

翻译：随着GPT-4和Claude 3.5等大型语言模型（LLMs）的快速普及，区分LLM生成文本与人工撰写内容的需求日益凸显，以遏制错误信息传播及教育领域滥用。水印技术作为一种前景广阔的解决方案，通过将细微统计信号嵌入LLM生成文本来实现可靠识别。本文首先通过引入灵活的加权公式，对先前研究的基于似然的LLM检测方法进行泛化，并进一步将该方法适配至逆变换采样技术。在超越水印检测的基础上，我们将这种自适应检测策略拓展至更具挑战性的任务：将给定文本分割为含水印与无水印的子字符串。相较于先前研究中依赖对提示估计高度敏感的下一词概率精确计算的方法，我们提出的框架无需精确的提示估计。大量数值实验表明，所提方法在准确分割混合含水印与无水印内容的文本方面，兼具高效性与鲁棒性。