In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
翻译:近年来,大语言模型(LLMs)由于能够生成类似人类的文本及其在软件工程等各个领域的潜在应用而获得显著支持。《守则》的LLMs通常接受关于从互联网中分离的源代码的大规模不卫生公司的培训。这些数据集的内容通常被这些模型以逐字方式记住和发布。在这项工作中,我们将讨论记忆的安全、隐私和许可证发放问题。我们争论为什么使用复制的代码来培训LMs是一个法律和伦理难题。最后,我们提出了解决这一问题的四项可行建议。</s>