Many generative foundation models (or GFMs) are trained on publicly available data and use public infrastructure, but 1) may degrade the "digital commons" that they depend on, and 2) do not have processes in place to return value captured to data producers and stakeholders. Existing conceptions of data rights and protection (focusing largely on individually-owned data and associated privacy concerns) and copyright or licensing-based models offer some instructive priors, but are ill-suited for the issues that may arise from models trained on commons-based data. We outline the risks posed by GFMs and why they are relevant to the digital commons, and propose numerous governance-based solutions that include investments in standardized dataset/model disclosure and other kinds of transparency when it comes to generative models' training and capabilities, consortia-based funding for monitoring/standards/auditing organizations, requirements or norms for GFM companies to contribute high quality data to the commons, and structures for shared ownership based on individual or community provision of fine-tuning data.
翻译:许多生成式基础模型(或GFMs)都是基于公共可用数据和公共基础设施进行培训的,但是1)它们可能会破坏它们所依赖的“数字共享”,2)没有相应的流程将捕获的价值返回给数据制作人和利益相关者。现有的数据权利和保护(主要关注个人拥有的数据和相关的隐私问题)以及版权或许可证模型提供了一些指导性的原则,但这些原则不适用于基于共享数据的模型可能出现的问题。我们概述了GFMs带来的风险以及它们与数字共享的关系,并提出了许多基于治理的解决方案,包括在生成模型的培训和能力方面投资标准化数据集/模型披露和其他类型的透明度,为监测/标准化/审计组织提供财团基础的资金,对GFM公司有要求或规范,要求其向共享共享数据中贡献高质量的数据,以及基于个人或社区提供微调数据的共享所有权的结构。