Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.
翻译:训练数据检测对于强制执行版权和数据许可至关重要,因为大型语言模型(LLM)是在从互联网抓取的海量文本语料上进行训练的。我们提出了SPECTRA,一种水印方法,即使水印数据仅占训练语料的不到0.001%,也能使其被可靠地检测到。SPECTRA的工作原理是使用LLM对文本进行释义,并根据一个独立的评分模型为每个释义分配一个基于其可能性的分数。选择一个释义,使其分数与原始文本的分数高度匹配,以避免引入任何分布偏移。为了测试可疑模型是否在带水印的数据上训练过,我们将其标记概率与评分模型的标记概率进行比较。我们证明,在检测用于训练的数据与未用于训练的数据时,SPECTRA能实现超过九个数量级的一致p值差距,这超过了所有测试的基线方法。SPECTRA为数据所有者提供了一种可扩展的、可在发布前部署的水印方案,即使在大规模LLM训练后也能留存。