Adversarial training (AT) is one of the most effective strategies for promoting model robustness, whereas even the state-of-the-art adversarially trained models struggle to exceed 65% robust test accuracy on CIFAR-10 without additional data, which is far from practical. A natural way to improve beyond this accuracy bottleneck is to introduce a rejection option, where confidence is a commonly used certainty proxy. However, the vanilla confidence can overestimate the model certainty if the input is wrongly classified. To this end, we propose to use true confidence (T-Con) (i.e., predicted probability of the true class) as a certainty oracle, and learn to predict T-Con by rectifying confidence. Intriguingly, we prove that under mild conditions, a rectified confidence (R-Con) rejector and a confidence rejector can be coupled to distinguish any wrongly classified input from correctly classified ones. We also quantify that training R-Con to be aligned with T-Con could be an easier task than learning robust classifiers. In our experiments, we evaluate our rectified rejection (RR) module on CIFAR-10, CIFAR-10-C, and CIFAR-100 under several attacks, and demonstrate that the RR module is well compatible with different AT frameworks on improving robustness, with little extra computation.
翻译:诚然,一种超越准确性瓶颈的自然改进方法是引入拒绝选项,信任是一种常用的确定性替代物。然而,香草信心可以高估模型确定性,如果输入的分类错误的话。为此,我们提议使用真正的信任(T-Con)(即真实阶级的预测概率)作为确定性或触觉,并学习通过纠正信心来预测T-Con。有趣的是,我们证明,在温和的条件下,纠正的信任(R-Con)拒绝者和拒绝信任者可以同时将任何错误分类的投入与正确分类的投入区分开来。我们还量化,培训R-Con可能比学习强力分类工更容易完成。在我们的实验中,我们评估了我们对CIFAR-10、CIFAR-10-10-C的校正校正拒绝(RRR)模块,CIFAR-10-C-10-C的校正校正校准模型和TRAR-100级框架的兼容性,在不相容性模型下,在不相容性模型和TRAR-100模型下展示了不相容性。