A wide breadth of research has devised data augmentation approaches that can improve both accuracy and generalization performance for neural networks. However, augmented data can end up being far from the clean training data and what is the appropriate label is less clear. Despite this, most existing work simply uses one-hot labels for augmented data. In this paper, we show re-using one-hot labels for highly distorted data might run the risk of adding noise and degrading accuracy and calibration. To mitigate this, we propose a generic method AutoLabel to automatically learn the confidence in the labels for augmented data, based on the transformation distance between the clean distribution and augmented distribution. AutoLabel is built on label smoothing and is guided by the calibration-performance over a hold-out validation set. We successfully apply AutoLabel to three different data augmentation techniques: the state-of-the-art RandAug, AugMix, and adversarial training. Experiments on CIFAR-10, CIFAR-100 and ImageNet show that AutoLabel significantly improves existing data augmentation techniques over models' calibration and accuracy, especially under distributional shift.
翻译:广泛的研究已经设计出数据增强方法,可以提高神经网络的准确性和一般性能,然而,数据增强后可能最终远离清洁培训数据,而适当的标签则不那么清楚。尽管如此,大多数现有工作只是使用单热标签来增加数据。在本文中,我们展示了对高度扭曲的数据重新使用单热标签可能会增加噪音和降低准确性和校准的风险。为了减轻这一风险,我们提议了一种通用方法AutoLabel,以便根据清洁分布与扩大分布之间的转换距离,自动学习对增强数据标签的信心。AutoLabel建在标签上,以校准性能为指导,而不是按暂停的验证组进行。我们成功地将AutoLabel应用到三种不同的数据增强技术:最先进的RandAug、AugMix和对抗性培训。关于CFAR-10、CIFAR-100和图像Net的实验显示,AutoLabel大大改进了模型校准和准确性的现有数据增强技术,特别是在分发式转换期间。