Due to the increasing computational demand of Deep Neural Networks (DNNs), companies and organizations have begun to outsource the training process. However, the externally trained DNNs can potentially be backdoor attacked. It is crucial to defend against such attacks, i.e., to postprocess a suspicious model so that its backdoor behavior is mitigated while its normal prediction power on clean inputs remain uncompromised. To remove the abnormal backdoor behavior, existing methods mostly rely on additional labeled clean samples. However, such requirement may be unrealistic as the training data are often unavailable to end users. In this paper, we investigate the possibility of circumventing such barrier. We propose a novel defense method that does not require training labels. Through a carefully designed layer-wise weight re-initialization and knowledge distillation, our method can effectively cleanse backdoor behaviors of a suspicious network with negligible compromise in its normal behavior. In experiments, we show that our method, trained without labels, is on-par with state-of-the-art defense methods trained using labels. We also observe promising defense results even on out-of-distribution data. This makes our method very practical.
翻译:由于深神经网络(DNN)的计算需求不断增加,公司和组织已开始将培训过程外包。然而,外部训练的DNN可能受到后门攻击。必须防范这类攻击,即处理可疑模型后,使其后门行为得到缓解,而其正常的清洁投入预测能力则不发生混乱。为了消除异常的后门行为,现有方法主要依赖额外的标签清洁样本。然而,这种要求可能并不现实,因为培训数据往往无法提供给终端用户。在本文中,我们研究了绕过这种屏障的可能性。我们提出了一种新的防御方法,不需要培训标签。通过精心设计的分层重塑和知识蒸馏,我们的方法可以有效地清除可疑网络的后门行为,而正常行为中几乎没有妥协。在实验中,我们证明在没有标签的情况下训练的方法与最先进的防御方法是平行的。我们还注意到即使在分配外数据上也有希望的防御结果。这使我们的方法非常实用。