Whilst adversarial attack detection has received considerable attention, it remains a fundamentally challenging problem from two perspectives. First, while threat models can be well-defined, attacker strategies may still vary widely within those constraints. Therefore, detection should be considered as an open-set problem, standing in contrast to most current detection approaches. These methods take a closed-set view and train binary detectors, thus biasing detection toward attacks seen during detector training. Second, limited information is available at test time and typically confounded by nuisance factors including the label and underlying content of the image. We address these challenges via a novel strategy based on random subspace analysis. We present a technique that utilizes properties of random projections to characterize the behavior of clean and adversarial examples across a diverse set of subspaces. The self-consistency (or inconsistency) of model activations is leveraged to discern clean from adversarial examples. Performance evaluations demonstrate that our technique ($AUC\in[0.92, 0.98]$) outperforms competing detection strategies ($AUC\in[0.30,0.79]$), while remaining truly agnostic to the attack strategy (for both targeted/untargeted attacks). It also requires significantly less calibration data (composed only of clean examples) than competing approaches to achieve this performance.
翻译:虽然对抗性攻击探测受到相当重视,但从两个角度看,它仍然是一个具有根本挑战性的问题。第一,虽然威胁模型可以定义明确,但攻击者战略在这些限制范围内可能仍然有很大差异。因此,应当将探测视为一个开放的、与目前大多数探测方法相反的问题。这些方法采用封闭式视图,并训练二进制探测器,从而将探测结果偏向于探测训练期间所看到的攻击。第二,在试验时可获得的信息有限,而且通常受到干扰因素(包括图像的标签和基本内容)的干扰。我们通过随机的子空间分析的新战略应对这些挑战。我们展示一种技术,利用随机预测的特性来描述一系列子空间的清洁和对抗性实例的行为。模型激活的自我一致性(或不一致)被用来从对抗性实例中辨别干净。绩效评估表明,我们的技术([0.92,0.98]美元)在测试时,不符合相互竞争的探测战略([0.30,0.79]美元),同时仍然对攻击战略进行真正的分析(仅要求对目标性/目标性攻击进行大幅度的校准)。