Existing foundation models are trained on copyrighted material. Deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation. In the United States and several other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. However, there is a caveat: If the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model. In this work, we emphasize that fair use is not guaranteed, and additional work may be necessary to keep model development and deployment squarely in the realm of fair use. First, we survey the potential risks of developing and deploying foundation models based on copyrighted content. We review relevant U.S. case law, drawing parallels to existing and potential applications for generating text, source code, and visual art. Experiments confirm that popular foundation models can generate content considerably similar to copyrighted material. Second, we discuss technical mitigations that can help foundation models stay in line with fair use. We argue that more research is needed to align mitigation strategies with the current state of the law. Lastly, we suggest that the law and technical mitigations should co-evolve. For example, coupled with other policy mechanisms, the law could more explicitly consider safe harbors when strong technical tools are used to mitigate infringement harms. This co-evolution may help strike a balance between intellectual property and innovation, which speaks to the original goal of fair use. But we emphasize that the strategies we describe here are not a panacea and more work is needed to develop policies that address the potential harms of foundation models.
翻译:现有的基础模型是基于版权材料进行训练的。在数据创作者未获得适当的归属或补偿情况下部署这些模型可能会带来法律和道德风险。在美国和其他几个国家,可以使用受版权保护的内容构建基础模型,而不会因公平使用原则而产生责任。然而,有一个警告:如果模型产生的输出与受版权保护的数据相似,特别是在影响该数据市场的情况下,公平使用可能不再适用于模型的输出。在这项工作中,我们强调并不保证公平使用,有必要进一步工作来确保模型的开发和部署完全落在公平使用的范畴内。首先,我们调查了基于受版权保护的内容开发和部署基础模型可能面临的潜在风险。我们审查了相关的美国判例法,从中找到了生成文本、源代码和视觉艺术方面的现有和潜在应用的相似之处。实验证实,流行的基础模型可以生成与受版权保护的材料非常相似的内容。其次,我们讨论了可以帮助基础模型保持符合公平使用的技术缓解措施。我们认为需要更多的研究来将缓解策略与当前的法律状态相一致。最后,我们建议法律和技术缓解措施应该共同演变。例如,与其他政策机制相结合,当强有力的技术工具被用来减轻侵权危害时,法律可以更明确地考虑安全港口。这种共同演变可以帮助在知识产权和创新之间取得平衡,这符合公平使用的最初目标。但我们强调,我们在此描述的策略并不是万能的,需要更多的工作来制定解决基础模型潜在危害问题的政策。