黑盒数据集拥有权验证 via 后门水印 (Black-box Dataset Ownership Verification via Backdoor Watermarking)

from arxiv, This paper is accepted by IEEE TIFS. 15 pages. The preliminary short version of this paper was posted on arXiv (arXiv:2010.05821) and presented in a non-archival NeurIPS Workshop (2020)

Deep learning, especially deep neural networks (DNNs), has been widely and successfully adopted in many critical applications for its high effectiveness and efficiency. The rapid development of DNNs has benefited from the existence of some high-quality datasets ($e.g.$, ImageNet), which allow researchers and developers to easily verify the performance of their methods. Currently, almost all existing released datasets require that they can only be adopted for academic or educational purposes rather than commercial purposes without permission. However, there is still no good way to ensure that. In this paper, we formulate the protection of released datasets as verifying whether they are adopted for training a (suspicious) third-party model, where defenders can only query the model while having no information about its parameters and training details. Based on this formulation, we propose to embed external patterns via backdoor watermarking for the ownership verification to protect them. Our method contains two main parts, including dataset watermarking and dataset verification. Specifically, we exploit poison-only backdoor attacks ($e.g.$, BadNets) for dataset watermarking and design a hypothesis-test-guided method for dataset verification. We also provide some theoretical analyses of our methods. Experiments on multiple benchmark datasets of different tasks are conducted, which verify the effectiveness of our method. The code for reproducing main experiments is available at \url{https://github.com/THUYimingLi/DVBW}.

翻译：深度学习，特别是深度神经网络（DNNs），已被广泛而成功地应用于许多关键应用中，因其高效性和高效性。 DNNs的快速发展受益于一些高质量的数据集的存在（例如ImageNet），这些数据集允许研究人员和开发人员轻松验证其方法的性能。目前，几乎所有现有的发布的数据集都要求其只能用于学术或教育目的，而不能在未经许可的情况下用于商业目的。但是，仍然没有很好的方法来确保那一点。在本文中，我们将保护发布数据集的形式化为验证是否用于训练（可疑的）第三方模型，其中保卫者只能查询模型而没有关于其参数和训练细节的信息。基于这个公式，我们建议通过后门水印嵌入外部模式来保护它们的拥有权。我们的方法包含两个主要部分，包括数据集水印和数据集验证。具体而言，我们利用仅毒后门攻击（例如BadNets）进行数据集水印，并为数据集验证设计了一个假设测试引导的方法。我们还提供了我们方法的一些理论分析。利用不同任务的多个基准数据集进行实验，验证了我们方法的有效性。复制主要实验的代码可在\url {https://github.com/THUYimingLi/DVBW}中找到。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【2023新书】实用数据隐私:增强数据的隐私性和安全性，599页pdf

专知会员服务

83+阅读 · 2023年5月1日

67页PPT【ML+气象】使用机器学习技术对季节和次季节研究和预测，Use of Machine Learning Techniques for Seasonal and Subseasonal Studies and Predictions

专知会员服务

19+阅读 · 2022年3月4日

ICLR2021放榜了！ 687篇入选34篇得满分！ 48篇orals，108篇spotlights，531篇poster

专知会员服务

24+阅读 · 2021年1月13日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日