This work concerns the development of deep networks that are certifiably robust to adversarial attacks. Joint robust classification-detection was recently introduced as a certified defense mechanism, where adversarial examples are either correctly classified or assigned to the "abstain" class. In this work, we show that such a provable framework can benefit by extension to networks with multiple explicit abstain classes, where the adversarial examples are adaptively assigned to those. We show that naively adding multiple abstain classes can lead to "model degeneracy", then we propose a regularization approach and a training method to counter this degeneracy by promoting full use of the multiple abstain classes. Our experiments demonstrate that the proposed approach consistently achieves favorable standard vs. robust verified accuracy tradeoffs, outperforming state-of-the-art algorithms for various choices of number of abstain classes.
翻译:这项工作涉及发展对对抗性攻击具有可证实的强大强力的深层网络。 联合强力分类检测最近作为一种经认证的国防机制被引入了联合强力分类检测机制,其中对抗性实例要么被正确分类,要么被划入“缓冲”类。 在这项工作中,我们表明,这种可验证的框架可以通过延伸而有益于具有多个明确弃权类的网络,而对抗性实例被适应性地分配给这些网络。 我们表明,天真地增加多个弃权类可能导致“示范性退化 ”, 然后我们提出一种正规化办法和培训方法,通过促进充分利用多个弃权类来应对这种退化。 我们的实验表明,拟议方法一贯地达到有利的标准,而不是强力的经核实的准确性权衡,优于各种选择弃权类的先进算法。