Adversarial training (AT) is one of the most effective strategies for promoting model robustness, whereas even the state-of-the-art adversarially trained models struggle to exceed 60% robust test accuracy on CIFAR-10 without additional data, which is far from practical. A natural way to break this accuracy bottleneck is to introduce a rejection option, where confidence is a commonly used certainty proxy. However, the vanilla confidence can overestimate the model certainty if the input is wrongly classified. To this end, we propose to use true confidence (T-Con) (i.e., predicted probability of the true class) as a certainty oracle, and learn to predict T-Con by rectifying confidence. We prove that under mild conditions, a rectified confidence (R-Con) rejector and a confidence rejector can be coupled to distinguish any wrongly classified input from correctly classified ones, even under adaptive attacks. We also quantify that training R-Con to be aligned with T-Con could be an easier task than learning robust classifiers. In our experiments, we evaluate our rectified rejection (RR) module on CIFAR-10, CIFAR-10-C, and CIFAR-100 under several attacks, and demonstrate that the RR module is well compatible with different AT frameworks on improving robustness, with little extra computation.
翻译:Aversarial培训(AT)是促进模型稳健性的最有效战略之一,而即使是最先进的对抗性训练模型也很难在没有额外数据的情况下在CIFAR-10上超过60%的稳健测试精度,而这远非实际。打破这种精准瓶颈的自然方法是引入拒绝选项,即信任是一种常用的确定性替代物。然而,如果对投入进行错误分类,香草信心可以高估模型确定性。为此,我们提议使用真正的信任(T-Con)(即真实阶级的预测概率)作为确定性或触觉,并学习通过纠正信心来预测T-C。我们证明,在温和的条件下,纠正的信任(R-C)拒绝和拒绝信任的自然方法是将任何错误分类的投入与正确分类的输入(即使是在适应性攻击的情况下)加以区分。我们还量化培训R-Con与T-Con的匹配比学习强健的分类员要容易得多。在我们的实验中,我们评估我们对CIFAR-10、CIFAR-10-C-10-C、CIFAR-10-C-10-C的校准校准拒绝(RAR-AR-10)模块的校正校正)的模块,在几式的模型下,在不相容和TRRAR-100的模型的模型下,对的校准的校准和外的模型的校准的模型的校准和的模型的校准和的模型的模型的校准和的校准和机框架之下,显示和机的校准性都很好。