To achieve reliable, robust, and safe AI systems it is important to implement fallback strategies when AI predictions cannot be trusted. Certifiers for neural networks are a reliable way to check the robustness of these predictions. They guarantee for some predictions that a certain class of manipulations or attacks could not have changed the outcome. For the remaining predictions without guarantees, the method abstains from making a prediction and a fallback strategy needs to be invoked, which typically incurs additional costs, can require a human operator, or even fail to provide any prediction. While this is a key concept towards safe and secure AI, we show for the first time that this approach comes with its own security risks, as such fallback strategies can be deliberately triggered by an adversary. Using training-time attacks, the adversary can significantly reduce the certified robustness of the model, making it unavailable. This transfers the main system load onto the fallback, reducing the overall system's integrity and availability. We design two novel backdoor attacks which show the practical relevance of these threats. For example, adding 1% poisoned data during training is sufficient to reduce certified robustness by up to 95 percentage points. Our extensive experiments across multiple datasets, model architectures, and certifiers demonstrate the wide applicability of these attacks. A first investigation into potential defenses shows that current approaches are insufficient to mitigate the issue, highlighting the need for new, more specific solutions.
翻译:为了实现可靠、稳健和安全的AI系统,当AI预测无法被信任时,必须执行后退战略。神经网络的验证者是检查这些预测是否稳健的可靠方法。它们保证某些预测能够保证某类操纵或攻击不会改变结果。对于其余的没有保证的预测来说,需要采用这种方法来避免作出预测和后退战略,这通常会产生额外的成本,可能需要一个人类操作者,甚至不能提供任何预测。虽然这是安全可靠的AI的关键概念,但我们第一次显示这一方法具有自己的安全风险,因为这种后退战略可以由对手故意触发。使用培训时间攻击,对手可以大大降低模型的经认证的稳健性,使其无法使用。这种方法将主系统负荷转移到后退,降低整个系统的完整性和可用性。我们设计了两种新型的后门攻击,表明这些威胁的实际相关性。例如,在培训期间增加1%的毒害数据足以减少经认证的稳健性,达到95个百分点。我们进行的广泛防御实验显示,这些潜在的多套数据测试将降低当前的精确度。