In this work, we study poison samples detection for defending against backdoor poisoning attacks on deep neural networks (DNNs). A principled idea underlying prior arts on this problem is to utilize the backdoored models' distinguishable behaviors on poison and clean populations to distinguish between these two different populations themselves and remove the identified poison. Many prior arts build their detectors upon a latent separability assumption, which states that backdoored models trained on the poisoned dataset will learn separable latent representations for backdoor and clean samples. Although such separation behaviors empirically exist for many existing attacks, there is no control on the separability and the extent of separation can vary a lot across different poison strategies, datasets, as well as the training configurations of backdoored models. Worse still, recent adaptive poison strategies can greatly reduce the "distinguishable behaviors" and consequently render most prior arts less effective (or completely fail). We point out that these limitations directly come from the passive reliance on some distinguishable behaviors that are not controlled by defenders. To mitigate such limitations, in this work, we propose the idea of active defense -- rather than passively assuming backdoored models will have certain distinguishable behaviors on poison and clean samples, we propose to actively enforce the trained models to behave differently on these two different populations. Specifically, we introduce confusion training as a concrete instance of active defense.
翻译:在这项工作中,我们研究毒物样本检测,以抵御深层神经网络(DNNs)的后门中毒袭击。 这个问题以前艺术的一个原则思想是,利用后门模型在毒物和清洁人群方面的可辨别行为来区分这两个不同的人群,并去除已查明的毒物。许多前科艺术在潜在分离假设的基础上建立了检测器,该假设指出,在毒物数据集方面受过训练的后门模型将发现后门和清洁样本的可辨别的潜在表现。虽然在很多现有袭击中,这种分离行为是存在的,但对于分离的可容性没有控制,分离的程度可以在不同毒物战略、数据集以及后门模型的培训配置中有很大差异。更糟糕的是,最近的适应毒物战略可以大大减少“可辨别的行为”,从而使得大多数前门的艺术变得无效(或完全失效 ) 。 我们指出,这些局限性直接来自被动地依赖一些未经维权者控制的可辨别的行为。 为了减轻这些限制,我们在此工作中提出积极防御的概念和范围,而不是被动地假定两种经过训练的后方模式,我们进行积极的反向式的争论。