Potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of "green" tokens before a word is generated, and then softly promoting use of green tokens during sampling. We propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and discuss robustness and security.
翻译:大型语言模型的潜在危害可以通过水标记模型输出来减轻,即将信号嵌入人类看不见但可以从短短范围内的象征物中测得的生成的文本中。 我们提议了专有语言模型的水标记框架。 水标记可以嵌入,对文本质量的影响微不足道,也可以使用高效的开放源算法检测,而不能使用语言模型API或参数。 水标记工作是在生成一个单词之前随机选择一套“绿色”符号,然后在取样过程中软地促进绿色标志的使用。 我们提出用可解释的p价值探测水标记的统计测试,并得出一个用于分析水标记灵敏度的信息理论框架。 我们使用开放前变换器(OPT)家族的数十亿参数模型测试水标记,并讨论坚固和安全性。