Several recent studies have shown that Deep Neural Network (DNN)-based classifiers are vulnerable against model extraction attacks. In model extraction attacks, an adversary exploits the target classifier to create a surrogate classifier imitating the target classifier with respect to some criteria. In this paper, we investigate the hardness degree of samples and demonstrate that the hardness degree histogram of model extraction attacks samples is distinguishable from the hardness degree histogram of normal samples. Normal samples come from the target classifier's training data distribution. As the training process of DNN-based classifiers is done in several epochs, we can consider this process as a sequence of subclassifiers so that each subclassifier is created at the end of an epoch. We use the sequence of subclassifiers to calculate the hardness degree of samples. We investigate the relation between hardness degree of samples and the trust in the classifier outputs. We propose Hardness-Oriented Detection Approach (HODA) to detect the sample sequences of model extraction attacks. The results demonstrate that HODA can detect the sample sequences of model extraction attacks with a high success rate by only watching 100 attack samples. We also investigate the hardness degree of adversarial examples and indicate that the hardness degree histogram of adversarial examples is distinct from the hardness degree histogram of normal samples.
翻译:最近的几项研究显示,深神经网络(DNN)的分类人员易受模型抽取攻击的伤害。在模型抽取攻击中,一个对手利用目标分类人员来建立一个代用分类人员,按照某些标准模仿目标分类人员。在本文件中,我们调查样品的硬度,并表明模型抽取攻击样品的硬度直方图可与正常样品的硬度直方图区别开来。普通样品来自目标分类人员的培训数据分布。由于DNN的分类人员的培训过程是在几个地方进行的,我们可以将这个过程视为一个子分类人员序列,以便每个子分类人员在某个地方的末端建立替代分类人员。我们使用子分类人员序列来计算样品的硬度,并表明样品的硬度和对分类结果的信任度之间的关系。我们建议采用硬度测深度探测方法来检测模型抽取攻击的样本序列。结果显示,HODADA能够从一个小分类人员中测出模型抽取攻击的样本序列,每个子分类人员在某个地方的末。我们使用子分类人员来计算其精确度的精确度样本,我们只能通过100度观察其精确度的精确度的精确度。