Deep Neural Networks (DNNs) are susceptible to backdoor attacks during training. The model corrupted in this way functions normally, but when triggered by certain patterns in the input, produces a predefined target label. Existing defenses usually rely on the assumption of the universal backdoor setting in which poisoned samples share the same uniform trigger. However, recent advanced backdoor attacks show that this assumption is no longer valid in dynamic backdoors where the triggers vary from input to input, thereby defeating the existing defenses. In this work, we propose a novel technique, Beatrix (backdoor detection via Gram matrix). Beatrix utilizes Gram matrix to capture not only the feature correlations but also the appropriately high-order information of the representations. By learning class-conditional statistics from activation patterns of normal samples, Beatrix can identify poisoned samples by capturing the anomalies in activation patterns. To further improve the performance in identifying target labels, Beatrix leverages kernel-based testing without making any prior assumptions on representation distribution. We demonstrate the effectiveness of our method through extensive evaluation and comparison with state-of-the-art defensive techniques. The experimental results show that our approach achieves an F1 score of 91.1% in detecting dynamic backdoors, while the state of the art can only reach 36.9%.
翻译:深心神经网络(DNNS) 很容易在训练期间受到后门攻击。 模型以这种方式腐蚀的方式正常运行, 但被输入中某些模式触发时, 产生一个预定义的目标标签。 现有的防御通常依赖于假设通用的后门环境, 中毒样品在其中拥有相同的统一触发器。 但是, 最近先进的后门攻击表明, 在动态的后门中, 触发器因输入和输入而不同, 从而击败现有的防御装置, 这一假设不再有效。 在这项工作中, 我们提出了一个创新技术, 碧翠丝( 通过格拉姆矩阵进行后门检测 ) 。 碧翠丝利用格拉姆矩阵不仅捕捉特征的关联性, 而且还捕捉显示显示显示显示代表分布的适当高端信息。 通过从正常样本的激活模式中学习类条件统计, 碧翠丝可以通过捕捉激活模式中的异常来辨别中毒样品。 为了进一步提高目标标签的性, 贝翠丝在不事先对代表分布做出任何假设的情况下, 以内核测试。 我们通过广泛的评估和比较来展示我们的方法的有效性。 。 碧瑞矩阵使用格矩阵不仅能够检测 F1 。