Most prior state-of-the-art adversarial detection works assume that the underlying vulnerable model is accessible, i,e., the model can be trained or its outputs are visible. However, this is not a practical assumption due to factors like model encryption, model information leakage and so on. In this work, we propose a model independent adversarial detection method using a simple energy function to distinguish between adversarial and natural inputs. We train a standalone detector independent of the underlying model, with sequential layer-wise training to increase the energy separation corresponding to natural and adversarial inputs. With this, we perform energy distribution-based adversarial detection. Our method achieves state-of-the-art detection performance (ROC-AUC > 0.9) across a wide range of gradient, score and decision-based adversarial attacks on CIFAR10, CIFAR100 and TinyImagenet datasets. Compared to prior approaches, our method requires ~10-100x less number of operations and parameters for adversarial detection. Further, we show that our detection method is transferable across different datasets and adversarial attacks. For reproducibility, we provide code in the supplementary material.
翻译:多数以前最先进的对抗性检测工作假定基本脆弱模型是无障碍的,即该模型可以培训,或者其产出是可见的;然而,由于模型加密、信息泄漏模型等因素,这不是一个实际的假设。在这项工作中,我们提议了一种使用简单的能源功能来区分对抗性和自然投入的独立的对抗性检测方法。我们训练了一个独立于基本模型的独立检测器,通过顺序层次培训,增加与自然投入和对抗性投入相对应的能源分离。我们进行基于能源分配的对抗性检测。我们的方法在广泛的梯度、分数和基于决定的对CIFAR10、CIFAR100和TinyyImagenet的对抗性袭击中取得了最先进的检测性能(ROC-AUC > 0.9)。与以前的方法相比,我们的方法需要~10-100x的操作量和参数来进行对抗性检测。此外,我们显示,我们的检测方法可以跨越不同的数据集和对抗性对立性袭击进行转移。为了重新解释,我们提供了补充材料中的代码。