Mixup, a simple data augmentation method that randomly mixes two data points via linear interpolation, has been extensively applied in various deep learning applications to gain better generalization. However, the theoretical underpinnings of its efficacy are not yet fully understood. In this paper, we aim to seek a fundamental understanding of the benefits of Mixup. We first show that Mixup using different linear interpolation parameters for features and labels can still achieve similar performance to the standard Mixup. This indicates that the intuitive linearity explanation in Zhang et al., (2018) may not fully explain the success of Mixup. Then we perform a theoretical study of Mixup from the feature learning perspective. We consider a feature-noise data model and show that Mixup training can effectively learn the rare features (appearing in a small fraction of data) from its mixture with the common features (appearing in a large fraction of data). In contrast, standard training can only learn the common features but fails to learn the rare features, thus suffering from bad generalization performance. Moreover, our theoretical analysis also shows that the benefits of Mixup for feature learning are mostly gained in the early training phase, based on which we propose to apply early stopping in Mixup. Experimental results verify our theoretical findings and demonstrate the effectiveness of the early-stopped Mixup training.
翻译:混合是一种简单的数据增强方法,通过线性内插随机混合两个数据点,它是一个简单的数据增强方法,它通过线性内插将两个数据点混在一起,已被广泛应用于各种深层学习应用中,以获得更好的概括化。然而,其功效的理论基础尚未完全理解。在本文件中,我们的目标是寻求对混合的好处的基本理解。我们首先显示,使用不同线性内插参数和标签的混合仍然能够取得与标准的混合性能相似的性能。这表明张等人(2018年)的直观线性解释可能无法充分解释混合的成功。然后我们从特征学习角度对混合进行理论性研究。我们考虑一个特效数据模型,并表明混合性培训能够有效地了解其与共同特征的混合性能(在大部分数据中出现 ) 的混合性能。 相比之下,标准培训只能了解共同特征,但无法了解稀有的特性,因此无法从差的简单化性能中了解。此外,我们的理论分析还表明,Mix性能的理论性能的早期学习结果大多在早期学习阶段得到。</s>