Recent attacks on Machine Learning (ML) models such as evasion attacks with adversarial examples and models stealing through extraction attacks pose several security and privacy threats. Prior work proposes to use adversarial training to secure models from adversarial examples that can evade the classification of a model and deteriorate its performance. However, this protection technique affects the model's decision boundary and its prediction probabilities, hence it might raise model privacy risks. In fact, a malicious user using only a query access to the prediction output of a model can extract it and obtain a high-accuracy and high-fidelity surrogate model. To have a greater extraction, these attacks leverage the prediction probabilities of the victim model. Indeed, all previous work on extraction attacks do not take into consideration the changes in the training process for security purposes. In this paper, we propose a framework to assess extraction attacks on adversarially trained models with vision datasets. To the best of our knowledge, our work is the first to perform such evaluation. Through an extensive empirical study, we demonstrate that adversarially trained models are more vulnerable to extraction attacks than models obtained under natural training circumstances. They can achieve up to $\times1.2$ higher accuracy and agreement with a fraction lower than $\times0.75$ of the queries. We additionally find that the adversarial robustness capability is transferable through extraction attacks, i.e., extracted Deep Neural Networks (DNNs) from robust models show an enhanced accuracy to adversarial examples compared to extracted DNNs from naturally trained (i.e. standard) models.
翻译:最近对机器学习(ML)模型的攻击,如利用对抗性实例和通过抽取攻击的偷盗模型的逃袭袭击,构成了若干安全和隐私威胁。先前的工作提议使用对抗性培训,确保模型能从可能逃避模型分类并降低其性能的对抗性实例中获得保障模式。然而,这种保护技术影响模型的决定界限及其预测概率,因此可能会增加模型隐私风险。事实上,只使用对模型预测输出的查询访问的恶意用户可以提取模型,并获得高度准确性和高度忠诚的代金模型。要获得更多的提取,这些攻击利用了受害者模型的预测概率。事实上,以前关于抽取攻击的所有工作都没有考虑到为安全目的培训过程的变化。在本文件中,我们提出了一个框架,用以评估对有视觉数据集的敌对性训练模型的抽取攻击。根据我们的知识,我们的工作是首先进行这种评价。通过广泛的实证研究,我们证明经过对抗性训练的模型比在自然培训环境下获得的模型更容易提取攻击。它们能够达到一个比美元-美元-美元-美元-美元-期间的可转让性检索能力,从一个比美元-美元-美元-美元-美元-时间查询更精确的精确度的提取能力,我们的工作是第一个进行更精确的计算。一个比一个比一个比一个更精确的精确的提取性模型,我们更精确的计算。