未定向背景水印：迈向无害和隐蔽的数据集版权保护 (Untargeted Backdoor Watermark: Towards Harmless and Stealthy Dataset Copyright Protection)

from arxiv, This work is accepted by the NeurIPS 2022 (selected as Oral paper, TOP 2%). The first two authors contributed equally to this work. 25 pages. We have fixed some typos in the previous version

Deep neural networks (DNNs) have demonstrated their superiority in practice. Arguably, the rapid development of DNNs is largely benefited from high-quality (open-sourced) datasets, based on which researchers and developers can easily evaluate and improve their learning methods. Since the data collection is usually time-consuming or even expensive, how to protect their copyrights is of great significance and worth further exploration. In this paper, we revisit dataset ownership verification. We find that existing verification methods introduced new security risks in DNNs trained on the protected dataset, due to the targeted nature of poison-only backdoor watermarks. To alleviate this problem, in this work, we explore the untargeted backdoor watermarking scheme, where the abnormal model behaviors are not deterministic. Specifically, we introduce two dispersibilities and prove their correlation, based on which we design the untargeted backdoor watermark under both poisoned-label and clean-label settings. We also discuss how to use the proposed untargeted backdoor watermark for dataset ownership verification. Experiments on benchmark datasets verify the effectiveness of our methods and their resistance to existing backdoor defenses. Our codes are available at \url{https://github.com/THUYimingLi/Untargeted_Backdoor_Watermark}.

翻译：深度神经网络（DNN）在实践中展现出其卓越性。可以说，DNN的快速发展在很大程度上得益于高质量（开放式）的数据集，基于这些数据集，研究人员和开发人员可以轻松地评估和改进他们的学习方法。由于数据收集通常是耗时甚至昂贵的，因此如何保护它们的版权非常重要，值得进一步探索。在本文中，我们重新审视了数据集所有权验证。我们发现，由于基于毒性背景水印的有针对性，现有的验证方法在受保护数据集上训练的DNN中引入了新的安全风险。为了缓解这个问题，我们在这项工作中探讨了未定向的背景水印方案，其中异常的模型行为并不是确定性的。具体地，我们引入了两个分散性并证明了它们之间的相关性，基于这一点，我们在污染标签和干净标签的情况下设计了未定向的背景水印。我们还讨论了如何使用所提议的未定向背景水印进行数据集所有权验证。基准数据集上的实验证明了我们的方法的有效性以及其对现有背景水印防御措施的抵抗能力。我们的代码可在 \url{https://github.com/THUYimingLi/Untargeted_Backdoor_Watermark} 中找到。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

【CVPR2021】深度稳定学习分布外泛化

专知会员服务

30+阅读 · 2021年5月20日

【ICLR2021】神经元注意力蒸馏消除DNN中的后门触发器

专知会员服务

15+阅读 · 2021年1月31日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日