Deep learning, especially deep neural networks (DNNs), has been widely and successfully adopted in many critical applications for its high effectiveness and efficiency. The rapid development of DNNs has benefited from the existence of some high-quality datasets ($e.g.$, ImageNet), which allow researchers and developers to easily verify the performance of their methods. Currently, almost all existing released datasets require that they can only be adopted for academic or educational purposes rather than commercial purposes without permission. However, there is still no good way to ensure that. In this paper, we formulate the protection of released datasets as verifying whether they are adopted for training a (suspicious) third-party model, where defenders can only query the model while having no information about its parameters and training details. Based on this formulation, we propose to embed external patterns via backdoor watermarking for the ownership verification to protect them. Our method contains two main parts, including dataset watermarking and dataset verification. Specifically, we exploit poison-only backdoor attacks ($e.g.$, BadNets) for dataset watermarking and design a hypothesis-test-guided method for dataset verification. We also provide some theoretical analyses of our methods. Experiments on multiple benchmark datasets of different tasks are conducted, which verify the effectiveness of our method. The code for reproducing main experiments is available at \url{https://github.com/THUYimingLi/DVBW}.
翻译:深度学习,特别是深度神经网络(DNNs),已被广泛而成功地应用于许多关键应用中,因其高效性和高效性。 DNNs的快速发展受益于一些高质量的数据集的存在(例如ImageNet),这些数据集允许研究人员和开发人员轻松验证其方法的性能。目前,几乎所有现有的发布的数据集都要求其只能用于学术或教育目的,而不能在未经许可的情况下用于商业目的。 但是,仍然没有很好的方法来确保那一点。在本文中,我们将保护发布数据集的形式化为验证是否用于训练(可疑的)第三方模型,其中保卫者只能查询模型而没有关于其参数和训练细节的信息。基于这个公式,我们建议通过后门水印嵌入外部模式来保护它们的拥有权。我们的方法包含两个主要部分,包括数据集水印和数据集验证。具体而言,我们利用仅毒后门攻击(例如BadNets)进行数据集水印,并为数据集验证设计了一个假设测试引导的方法。我们还提供了我们方法的一些理论分析。利用不同任务的多个基准数据集进行实验,验证了我们方法的有效性。复制主要实验的代码可在\url {https://github.com/THUYimingLi/DVBW}中找到。