The application of machine learning in safety-critical systems requires a reliable assessment of uncertainty. However, deep neural networks are known to produce highly overconfident predictions on out-of-distribution (OOD) data. Even if trained to be non-confident on OOD data, one can still adversarially manipulate OOD data so that the classifier again assigns high confidence to the manipulated samples. We show that two previously published defenses can be broken by better adapted attacks, highlighting the importance of robustness guarantees around OOD data. Since the existing method for this task is hard to train and significantly limits accuracy, we construct a classifier that can simultaneously achieve provably adversarially robust OOD detection and high clean accuracy. Moreover, by slightly modifying the classifier's architecture our method provably avoids the asymptotic overconfidence problem of standard neural networks. We provide code for all our experiments.
翻译:在安全临界系统中应用机器学习要求可靠地评估不确定性。 但是,深神经网络已知对分配(OOD)数据产生高度自信的预测。 即使经过训练对OOD数据不自信,人们仍然可以对 OOD数据进行对抗性操作,以使分类者再次对被操纵的样本具有高度信心。我们表明,先前公布的两种防御可以通过更适应性强的攻击来打破,强调OOD数据周围稳健性保障的重要性。由于这项任务的现有方法难以培训和大大限制准确性,我们建造了一个分类器,既能同时取得对抗性强的OOD探测,又能高清晰度。此外,通过稍微修改分类器的结构,我们的方法可以避免标准神经网络的无谓过度自信问题。我们为所有实验提供代码。