Generative Artificial Intelligence (GAI) has experienced exponential growth in recent years, partly facilitated by the abundance of large-scale open-source datasets. These datasets are often built using unrestricted and opaque data collection practices. While most literature focuses on the development and applications of GAI models, the ethical and legal considerations surrounding the creation of these datasets are often neglected. In addition, as datasets are shared, edited, and further reproduced online, information about their origin, legitimacy, and safety often gets lost. To address this gap, we introduce the Compliance Rating Scheme (CRS), a framework designed to evaluate dataset compliance with critical transparency, accountability, and security principles. We also release an open-source Python library built around data provenance technology to implement this framework, allowing for seamless integration into existing dataset-processing and AI training pipelines. The library is simultaneously reactive and proactive, as in addition to evaluating the CRS of existing datasets, it equally informs responsible scraping and construction of new datasets.
翻译:生成式人工智能近年来呈指数级增长,部分得益于大规模开源数据集的丰富性。这些数据集通常采用无限制且不透明的数据收集方法构建。尽管现有文献多聚焦于生成式人工智能模型的开发与应用,但关于数据集构建的伦理与法律考量常被忽视。此外,随着数据集在网络上被共享、编辑和再传播,其来源、合法性与安全性信息往往丢失。为弥补这一空白,我们提出合规评级方案——一个基于数据溯源技术构建的评估框架,用于衡量数据集在关键透明度、问责制与安全性原则方面的合规性。我们同时发布了围绕数据溯源技术开发的开源Python库,以实现该框架,并支持无缝集成至现有数据集处理与人工智能训练流程中。该库兼具反应性与前瞻性:除评估现有数据集的合规评级外,还能为负责任的新数据集爬取与构建提供指导。