Deep neural networks are vulnerable to malicious fine-tuning attacks such as data poisoning and backdoor attacks. Therefore, in recent research, it is proposed how to detect malicious fine-tuning of neural network models. However, it usually negatively affects the performance of the protected model. Thus, we propose a novel neural network fragile watermarking with no model performance degradation. In the process of watermarking, we train a generative model with the specific loss function and secret key to generate triggers that are sensitive to the fine-tuning of the target classifier. In the process of verifying, we adopt the watermarked classifier to get labels of each fragile trigger. Then, malicious fine-tuning can be detected by comparing secret keys and labels. Experiments on classic datasets and classifiers show that the proposed method can effectively detect model malicious fine-tuning with no model performance degradation.
翻译:深神经网络很容易受到恶意微调攻击,如数据中毒和后门攻击等。因此,在最近的研究中,建议如何探测神经网络模型的恶意微调,但通常会对受保护模型的性能产生消极影响。因此,我们提议建立一个新型神经网络脆弱水标记,没有模型性能退化。在水标记过程中,我们训练一个具有具体损失功能的基因化模型和产生对目标分类器的微调敏感的触发器的秘密钥匙。在核查过程中,我们采用水标记分类器,以获得每个脆弱触发器的标签。然后,通过比较秘密钥匙和标签可以检测出恶意微调。典型数据集和分类器的实验显示,拟议的方法能够有效地检测模型恶意微调,而没有模型性能退化。