ZKPROV：面向大型语言模型数据集溯源的一种零知识方法 (ZKPROV: A Zero-Knowledge Approach to Dataset Provenance for Large Language Models)

As large language models (LLMs) are used in sensitive fields, accurately verifying their computational provenance without disclosing their training datasets poses a significant challenge, particularly in regulated sectors such as healthcare, which have strict requirements for dataset use. Traditional approaches either incur substantial computational cost to fully verify the entire training process or leak unauthorized information to the verifier. Therefore, we introduce ZKPROV, a novel cryptographic framework allowing users to verify that the LLM's responses to their prompts are trained on datasets certified by the authorities that own them. Additionally, it ensures that the dataset's content is relevant to the users' queries without revealing sensitive information about the datasets or the model parameters. ZKPROV offers a unique balance between privacy and efficiency by binding training datasets, model parameters, and responses, while also attaching zero-knowledge proofs to the responses generated by the LLM to validate these claims. Our experimental results demonstrate sublinear scaling for generating and verifying these proofs, with end-to-end overhead under 3.3 seconds for models up to 8B parameters, presenting a practical solution for real-world applications. We also provide formal security guarantees, proving that our approach preserves dataset confidentiality while ensuring trustworthy dataset provenance.

翻译：随着大型语言模型（LLMs）在敏感领域中的应用日益广泛，如何在不泄露其训练数据集的前提下准确验证其计算溯源成为一个重大挑战，这在医疗等对数据集使用有严格要求的受监管领域尤为突出。传统方法要么需要高昂的计算成本来完整验证整个训练过程，要么会向验证方泄露未授权信息。为此，我们提出了ZKPROV——一种新型密码学框架，允许用户验证LLM对其提示的响应是基于数据所有者权威认证的数据集训练生成的。此外，该框架能确保数据集内容与用户查询相关，同时不泄露数据集或模型参数的敏感信息。ZKPROV通过将训练数据集、模型参数和响应进行绑定，并在LLM生成的响应上附加零知识证明以验证这些声明，在隐私保护与效率之间实现了独特的平衡。实验结果表明，生成和验证这些证明具有亚线性扩展特性，对于参数量高达80亿的模型，端到端开销低于3.3秒，为实际应用提供了可行的解决方案。我们还提供了形式化的安全保证，证明该方法在确保可信数据集溯源的同时能有效保护数据集机密性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【AAAI2026】TOFA：面向视觉-语言模型的免训练一次性联邦自适应方法

专知会员服务

12+阅读 · 11月23日

【ICLR2025】为多模态图像-文本表示可解释性缩小信息瓶颈理论

专知会员服务

15+阅读 · 2月24日

【AAAI2025】TimeDP：通过领域提示学习生成多领域时间序列

专知会员服务

14+阅读 · 1月10日

【WSDM2024】DiffKG:面向推荐的知识图谱扩散模型

专知会员服务

28+阅读 · 2024年1月17日