Trojan attacks are sophisticated training-time attacks on neural networks that embed backdoor triggers which force the network to produce a specific output on any input which includes the trigger. With the increasing relevance of deep networks which are too large to train with personal resources and which are trained on data too large to thoroughly audit, these training-time attacks pose a significant risk. In this work, we connect trojan attacks to Neural Collapse, a phenomenon wherein the final feature representations of over-parameterized neural networks converge to a simple geometric structure. We provide experimental evidence that trojan attacks disrupt this convergence for a variety of datasets and architectures. We then use this disruption to design a lightweight, broadly generalizable mechanism for cleansing trojan attacks from a wide variety of different network architectures and experimentally demonstrate its efficacy.
翻译:后门攻击是一种针对神经网络的复杂训练时攻击,通过嵌入后门触发器,迫使网络在任何包含触发器的输入上产生特定输出。随着深度网络的重要性日益增加——这些网络规模过大而无法用个人资源训练,且训练数据量过大而难以全面审计——此类训练时攻击构成了显著风险。在本研究中,我们将后门攻击与神经坍缩现象联系起来,该现象指过参数化神经网络的最终特征表示会收敛到简单的几何结构。我们通过实验证明,后门攻击会破坏多种数据集和架构下的这种收敛性。基于此破坏效应,我们设计了一种轻量级、广泛通用的后门清除机制,适用于多种不同网络架构,并通过实验验证了其有效性。