Data augmentation is a widely used technique and an essential ingredient in the recent advance in self-supervised representation learning. By preserving the similarity between augmented data, the resulting data representation can improve various downstream analyses and achieve state-of-art performance in many applications. To demystify the role of data augmentation, we develop a statistical framework on a low-dimension product manifold to theoretically understand why the unlabeled augmented data can lead to useful data representation. Under this framework, we propose a new representation learning method called augmentation invariant manifold learning and develop the corresponding loss function, which can work with a deep neural network to learn data representations. Compared with existing methods, the new data representation simultaneously exploits the manifold's geometric structure and invariant property of augmented data. Our theoretical investigation precisely characterizes how the data representation learned from augmented data can improve the $k$-nearest neighbor classifier in the downstream analysis, showing that a more complex data augmentation leads to more improvement in downstream analysis. Finally, numerical experiments on simulated and real datasets are presented to support the theoretical results in this paper.
翻译:数据增强是一种广泛使用的技术,也是最近自我监督代表性学习进展中的一个基本组成部分。通过保持扩大数据之间的相似性,所产生的数据表述方法可以改进各种下游分析,并在许多应用中达到最先进的性能。为了解开数据增强作用的神秘性,我们开发了一个低分散产品统计框架,从理论上理解为何未加标签的增强数据能够带来有用的数据表述。在这个框架内,我们提出了一个新的代表性学习方法,称为“增加变化中的多元学习,并开发相应的损失函数,与深神经网络合作,学习数据表述。与现有方法相比,新的数据表述方法同时利用了多元的几何结构以及增强数据的无变动性属性。我们的理论调查准确地说明了从扩大数据中获取的数据表述方法如何能够改进下游分析中最接近的邻里分类器,表明更复杂的数据增强导致下游分析的改进。最后,关于模拟和真实数据集的数字实验将支持本文的理论结果。