Many machine learning adversarial attacks find adversarial samples of a victim model ${\mathcal M}$ by following the gradient of some attack objective functions, either explicitly or implicitly. To confuse and detect such attacks, we take the proactive approach that modifies those functions with the goal of misleading the attacks to some local minimals, or to some designated regions that can be easily picked up by an analyzer. To achieve this goal, we propose adding a large number of artifacts, which we called $attractors$, onto the otherwise smooth function. An attractor is a point in the input space, where samples in its neighborhood have gradient pointing toward it. We observe that decoders of watermarking schemes exhibit properties of attractors and give a generic method that injects attractors from a watermark decoder into the victim model ${\mathcal M}$. This principled approach allows us to leverage on known watermarking schemes for scalability and robustness and provides explainability of the outcomes. Experimental studies show that our method has competitive performance. For instance, for un-targeted attacks on CIFAR-10 dataset, we can reduce the overall attack success rate of DeepFool to 1.9%, whereas known defense LID, FS and MagNet can reduce the rate to 90.8%, 98.5% and 78.5% respectively.
翻译:许多机器学习对抗性攻击发现受害者模型的对抗性样本$_mathcal M}。为了明确或隐含地分辨和检测这些攻击,我们采取了积极主动的方法,将这些功能修改为将攻击误导到某些当地最低点或一些可轻易被分析师收集到的指定区域。为了实现这一目标,我们提议在本来的平稳功能上添加大量人工制品,我们称之为美元吸引器。一个吸引器是输入空间的一个点,其周围的样品有梯度。我们观察到,水标记仪显示吸引器的特性,并给出一种通用方法,将水标记分解码器吸引器注入受害者模型$_mathcal M}。这一原则方法使我们能够利用已知的水标记计划,以利缩放和坚固,并解释结果。实验研究表明,我们的方法具有竞争性。例如,在输入空间中,其周围的样品有梯度。我们观察到,水标记仪的分标记仪具有吸引器的特性,并给出一种通用的方法,将吸引器从水标记器吸引器输入器的特性输入到受害者模型的模模模模上。 $_mfathFSefer5 98% 和FSeloFSrefine1.9。我们分别可以降低78的进攻率。