A key concept towards reliable, robust, and safe AI systems is the idea to implement fallback strategies when predictions of the AI cannot be trusted. Certifiers for neural networks have made great progress towards provable robustness guarantees against evasion attacks using adversarial examples. These methods guarantee for some predictions that a certain class of manipulations or attacks could not have changed the outcome. For the remaining predictions without guarantees, the method abstains from making a prediction and a fallback strategy needs to be invoked, which is typically more costly, less accurate, or even involves a human operator. While this is a key concept towards safe and secure AI, we show for the first time that this strategy comes with its own security risks, as such fallback strategies can be deliberately triggered by an adversary. In particular, we conduct the first systematic analysis of training-time attacks against certifiers in practical application pipelines, identifying new threat vectors that can be exploited to degrade the overall system. Using these insights, we design two backdoor attacks against network certifiers, which can drastically reduce certified robustness. For example, adding 1% poisoned data during training is sufficient to reduce certified robustness by up to 95 percentage points, effectively rendering the certifier useless. We analyze how such novel attacks can compromise the overall system's integrity or availability. Our extensive experiments across multiple datasets, model architectures, and certifiers demonstrate the wide applicability of these attacks. A first investigation into potential defenses shows that current approaches are insufficient to mitigate the issue, highlighting the need for new, more specific solutions.
翻译:可靠、稳健和安全的AI系统的关键概念是,当对AI的预测无法令人信任时,执行后退战略的构想。神经网络的验证者在利用对抗性实例确保防止规避攻击的可靠稳健性方面取得了很大进展。这些方法保证了某些预测,即某类操纵或攻击不可能改变结果。对于其余的没有保证的预测来说,该方法必须避免作出预测和后退战略,这通常更昂贵,更不准确,甚至涉及一个人类操作者。虽然这是安全可靠的AI的关键概念,但我们第一次表明,这一战略的可适用性随其自身的安全风险而出现,因为这类后退战略可以由对手故意触发。特别是,我们首次系统分析对实际应用管道验证者的培训时间攻击不可能改变结果。对于其余的预测来说,该方法需要采用新的威胁矢量。使用这些模型,我们设计了两种对网络验证者进行后门攻击的方法,这可以大大降低经认证的稳健性。例如,在培训中增加1%的当前毒害性数据将足以降低整个攻击的准确性。我们对这些系统的系统进行彻底分析。