Transformation invariances are present in many real-world problems. For example, image classification is usually invariant to rotation and color transformation: a rotated car in a different color is still identified as a car. Data augmentation, which adds the transformed data into the training set and trains a model on the augmented data, is one commonly used technique to build these invariances into the learning process. However, it is unclear how data augmentation performs theoretically and what the optimal algorithm is in presence of transformation invariances. In this paper, we study PAC learnability under transformation invariances in three settings according to different levels of realizability: (i) A hypothesis fits the augmented data; (ii) A hypothesis fits only the original data and the transformed data lying in the support of the data distribution; (iii) Agnostic case. One interesting observation is that distinguishing between the original data and the transformed data is necessary to achieve optimal accuracy in setting (ii) and (iii), which implies that any algorithm not differentiating between the original and transformed data (including data augmentation) is not optimal. Furthermore, this type of algorithms can even "harm" the accuracy. In setting (i), although it is unnecessary to distinguish between the two data sets, data augmentation still does not perform optimally. Due to such a difference, we propose two combinatorial measures characterizing the optimal sample complexity in setting (i) and (ii)(iii) and provide the optimal algorithms.
翻译:在许多现实世界的问题中存在变异。 例如,图像分类通常不易于旋转和颜色变异。 图像分类通常与旋转和颜色变异有关: 以不同颜色旋转的汽车仍然被识别为汽车。 数据增强将转换的数据添加到培训数据集中,并培训关于扩大数据模型的模型,是将这些变异添加到学习过程中的一种常用技术。 但是,数据增强如何在理论上发挥作用,以及哪种最佳算法在变异情况下是最佳算法。 在本文中,我们根据不同程度的变异来研究三种情况下变异中PAC的易懂性: (一) 假设适合扩大的数据;(二) 假设仅适合原始数据,以及支持数据分配过程中的变异数据;(三) Amnistic 案例。 一个有趣的观察是,区分原始数据和变异变数据在设定(二) 和(三) 意味着任何不区分原始和变异数据的算法(包括数据增异化)不是最佳的。 此外,这种算法甚至“ 最优化的变异性” 在设定数据时,我们提出这种变异性( 最不必要地) 将数据排序中, 进行这种变异化的变变的变异性变异性(我们提出数据( ) 进行这种变异性变的变的变的变的变的变更的变的变的变式) 。