Intelligent or generative writing tools rely on large language models that recognize, summarize, translate, and predict content. This position paper probes the copyright interests of open data sets used to train large language models (LLMs). Our paper asks, how do LLMs trained on open data sets circumvent the copyright interests of the used data? We start by defining software copyright and tracing its history. We rely on GitHub Copilot as a modern case study challenging software copyright. Our conclusion outlines obstacles that generative writing assistants create for copyright, and offers a practical road map for copyright analysis for developers, software law experts, and general users to consider in the context of intelligent LLM-powered writing tools.
翻译:那是谁的文本?探索BigCode、知识产权和伦理学
智能或生成写作工具依赖于大型语言模型,这些模型能够识别、归纳、翻译和预测内容。本文探讨了用于训练大型语言模型(LLMs)的开源数据集的版权利益。我们提出了问题:基于开源数据集训练的LLMs如何规避使用数据的版权利益?我们从定义软件版权和追溯其历史开始,并以GitHub Copilot作为一个挑战软件版权的现代案例研究。我们的结论概述了生成写作助手对版权创造的障碍,并为开发者、软件法律专家和一般用户提供了实用的版权分析路线图,以在智能LLM驱动的写作工具的背景下进行考虑。