DNN 中数据验证数据的数据套头 (Data Isotopes for Data Provenance in DNNs)

Today, creators of data-hungry deep neural networks (DNNs) scour the Internet for training fodder, leaving users with little control over or knowledge of when their data is appropriated for model training. To empower users to counteract unwanted data use, we design, implement and evaluate a practical system that enables users to detect if their data was used to train an DNN model. We show how users can create special data points we call isotopes, which introduce "spurious features" into DNNs during training. With only query access to a trained model and no knowledge of the model training process, or control of the data labels, a user can apply statistical hypothesis testing to detect if a model has learned the spurious features associated with their isotopes by training on the user's data. This effectively turns DNNs' vulnerability to memorization and spurious correlations into a tool for data provenance. Our results confirm efficacy in multiple settings, detecting and distinguishing between hundreds of isotopes with high accuracy. We further show that our system works on public ML-as-a-service platforms and larger models such as ImageNet, can use physical objects instead of digital marks, and remains generally robust against several adaptive countermeasures.

翻译：今天,数据饥饿深度神经网络(DNNs)的创建者在互联网上为培训饲料而搜寻数据,使用户对其数据何时被分配用于示范培训没有多少控制权或知识。为了使用户有能力抵制不想要的数据使用,我们设计、实施和评价一个实用系统,使用户能够检测数据是否被用于培训DNN模式。我们展示用户如何创建特殊数据点,我们称之为同位素,在培训期间将“净性特征”引入DNS。只要查询一个经过培训的模型,对模型培训过程没有了解,或者对数据标签没有控制,用户就可以应用统计假设测试来检测模型是否已经通过培训用户数据而学会了与其同位素相关的虚假特征。这有效地将DNNS对记忆的脆弱性和虚假关联转化为数据验证工具。我们的结果证实了多个环境的功效,检测和辨别了数百个高精确的同位素。我们进一步显示,我们的系统在公共ML-A服务平台和图像网络等较大模型上工作,可以使用物理物体,而不是数字标记的适应性。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日