Past literature has illustrated that language models (LMs) often memorize parts of training instances and reproduce them in natural language generation (NLG) processes. However, it is unclear to what extent LMs "reuse" a training corpus. For instance, models can generate paraphrased sentences that are contextually similar to training samples. In this work, therefore, we study three types of plagiarism (i.e., verbatim, paraphrase, and idea) among GPT-2 generated texts, in comparison to its training data, and further analyze the plagiarism patterns of fine-tuned LMs with domain-specific corpora which are extensively used in practice. Our results suggest that (1) three types of plagiarism widely exist in LMs beyond memorization, (2) both size and decoding methods of LMs are strongly associated with the degrees of plagiarism they exhibit, and (3) fine-tuned LMs' plagiarism patterns vary based on their corpus similarity and homogeneity. Given that a majority of LMs' training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, raising concerns about indiscriminately pursuing larger models with larger training corpora. Plagiarized content can also contain individuals' personal and sensitive information. These findings overall cast doubt on the practicality of current LMs in mission-critical writing tasks and urge more discussions around the observed phenomena. Data and source code are available at https://github.com/Brit7777/LM-plagiarism.
翻译:过去的文献表明,语言模型(LMS)往往将部分培训实例混为一文,并在自然语言生成过程中复制。然而,尚不清楚LMS“再使用”培训材料的范围究竟有多大。例如,模型可以产生与培训样本相近的外语句句句句。因此,在这项工作中,我们研究了GPT-2中三种类型的百草枯(即,逐字记录、讲解和想法)与培训数据相比生成的文本,并进一步分析了精细调整LMS与具体领域公司(在实践中广泛使用)相比的典型模式。然而,尚不清楚LMS“再使用”培训材料的范围有多大。我们的结果表明:(1) LMS广泛存在于LMS中,三种类型的百草枯模型与培训样本展示的程度密切相关,(3) 精细调整LMS的百草枯培训模式因其性质和同质性而各异。鉴于大多数LMS的培训数据数据是从网络上筛选出来的,而没有向内容主介绍其总体分析结果。LMSmissionMs(Lmissional ) 其核心培训内容和数据来源都有。