Potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of whitelist tokens before a word is generated, and then softly promoting use of whitelist tokens during sampling. We propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and discuss robustness and security.
翻译:大型语言模型的潜在危害可以通过水标记模型输出来减轻,即将信号嵌入对人类不可见但可以从短距离的象征物中测得的生成的文本中。 我们提议了专有语言模型的水标记框架。 水标记可以嵌入,对文本质量的影响微不足道,也可以使用高效的开放源算法探测,而不能使用语言模型API或参数。 水标记工作是在生成一个单词之前选择一组随机的白列表符号,然后在取样过程中软地促进白列表符号的使用。 我们提出用可解释的p价值探测水标记的统计测试,并得出一个用于分析水标记灵敏度的信息理论框架。 我们用开放前变换器(OPT)家族的数十亿参数模型测试水标记,并讨论坚固和安全性。