Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning. Specific augmentations like translations and scaling in computer vision are traditionally believed to improve generalization by generating new (artificial) data from the same distribution. However, this traditional viewpoint does not explain the success of prevalent augmentations in modern machine learning (e.g. randomized masking, cutout, mixup), that greatly alter the training data distribution. In this work, we develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression. These effects, when applied to popular augmentations, give rise to a wide variety of phenomena, including discrepancies in generalization between over-parameterized and under-parameterized regimes and differences between regression and classification tasks. Our framework highlights the nuanced and sometimes surprising impacts of DA on generalization, and serves as a testbed for novel augmentation design.
翻译:数据增强(DA)是提高现代机器学习绩效的有力工作马(DA),在现代机器学习中,数据增强(DA)是提高现代机器学习绩效的强大工作马。计算机视觉的翻译和扩展等具体增强传统上被认为通过从同一分布中产生新的(人工)数据来改进一般化。然而,这种传统观点并不能解释现代机器学习(例如随机化遮掩、切除、混合)普遍增强成功,大大改变了培训数据分布。在这项工作中,我们开发了一个新的理论框架,以说明一般的DA类别对参数化不足和过度隔离的线性模型一般化的影响。我们的框架显示,DA通过两种不同效果的结合,(a)以依赖数据的方式操纵数据共变矩阵的易变值的相对比例,(b)通过山脊回归统一地统一提升数据常态矩阵的全方位。这些效应适用于大众扩增时,会产生各种各样的现象,包括超度和超度和超度化的光谱化制度之间的一般化差异,以及回归和病床分类之间的差异。我们的框架有时强调对DA的升级和升级的影响。