In this paper, we propose a novel and practical mechanism which enables the service provider to verify whether a suspect model is stolen from the victim model via model extraction attacks. Our key insight is that the profile of a DNN model's decision boundary can be uniquely characterized by its \textit{Universal Adversarial Perturbations (UAPs)}. UAPs belong to a low-dimensional subspace and piracy models' subspaces are more consistent with victim model's subspace compared with non-piracy model. Based on this, we propose a UAP fingerprinting method for DNN models and train an encoder via \textit{contrastive learning} that takes fingerprint as inputs, outputs a similarity score. Extensive studies show that our framework can detect model IP breaches with confidence $> 99.99 \%$ within only $20$ fingerprints of the suspect model. It has good generalizability across different model architectures and is robust against post-modifications on stolen models.
翻译:在本文中,我们提出了一个新颖而实用的机制,使服务提供者能够核实一个嫌疑人模型是否通过模型提取攻击从受害者模型中被盗。我们的主要见解是,DNN模型决定界限的特征可以具有独特的特征,其特征为\textit{Uiversal Aversarial Proturbations(UAPs)}。UAPs属于一个低维次空间和海盗模型的子空间,与非海盗模型相比,它更符合受害者模型的子空间。在此基础上,我们为DNN模型提出了一个UAP指纹鉴定方法,并通过\textit{contractical learning}培训一个以指纹作为投入的编码器,并得出相似的分数。广泛的研究表明,我们的框架能够以信任度 > 99.99 ⁇ $ $ $ 美元,在嫌疑人模型的20美元指纹范围内探测示范IP违规情况。它在不同模型结构中具有很好的可概括性,并且能够抵御被盗模型的后篡改。