Machine learning is vulnerable to adversarial manipulation. Previous literature has demonstrated that at the training stage attackers can manipulate data and data sampling procedures to control model behaviour. A common attack goal is to plant backdoors i.e. force the victim model to learn to recognise a trigger known only by the adversary. In this paper, we introduce a new class of backdoor attacks that hide inside model architectures i.e. in the inductive bias of the functions used to train. These backdoors are simple to implement, for instance by publishing open-source code for a backdoored model architecture that others will reuse unknowingly. We demonstrate that model architectural backdoors represent a real threat and, unlike other approaches, can survive a complete re-training from scratch. We formalise the main construction principles behind architectural backdoors, such as a link between the input and the output, and describe some possible protections against them. We evaluate our attacks on computer vision benchmarks of different scales and demonstrate the underlying vulnerability is pervasive in a variety of training settings.
翻译:先前的文献表明,在培训阶段,攻击者可以操纵数据和数据取样程序来控制示范行为。一个共同的攻击目标是将后门植入,即强迫受害者模型学会识别只有对手才知道的触发因素。在本文中,我们引入了一种新的后门攻击类别,隐藏在模型结构内,即用于训练的功能的诱导偏差中。这些后门易于实施,例如,公布供他人在不知情的情况下再利用的后门模型结构的公开源码。我们证明,建筑后门模型是一种真正的威胁,与其他方法不同,从零开始完全再培训可以幸存下来。我们正式确定了建筑后门背后的主要建筑原则,例如投入和输出之间的联系,并描述一些可能针对它们的保护。我们评估了对不同尺度的计算机视觉基准的攻击,并表明潜在的脆弱性在各种培训环境中十分普遍。