混合数据增强调查:分类学、方法、应用和可解释性 (A Survey of Mix-based Data Augmentation: Taxonomy, Methods, Applications, and Explainability)

Data augmentation (DA) is indispensable in modern machine learning and deep neural networks. The basic idea of DA is to construct new training data to improve the model's generalization by adding slightly disturbed versions of existing data or synthesizing new data. In this work, we review a small but essential subset of DA -- Mix-based Data Augmentation (MixDA) that generates novel samples by mixing multiple examples. Unlike conventional DA approaches based on a single-sample operation or requiring domain knowledge, MixDA is more general in creating a broad spectrum of new data and has received increasing attention in the community. We begin with proposing a new taxonomy classifying MixDA into, Mixup-based, Cutmix-based, and hybrid approaches according to a hierarchical view of the data mix. Various MixDA techniques are then comprehensively reviewed in a more fine-grained way. Owing to its generalization, MixDA has penetrated a variety of applications which are also completely reviewed in this work. We also examine why MixDA works from different aspects of improving model performance, generalization, and calibration while explaining the model behavior based on the properties of MixDA. Finally, we recapitulate the critical findings and fundamental challenges of current MixDA studies, and outline the potential directions for future works. Different from previous related works that summarize the DA approaches in a specific domain (e.g., images or natural language processing) or only review a part of MixDA studies, we are the first to provide a systematical survey of MixDA in terms of its taxonomy, methodology, applications, and explainability. This work can serve as a roadmap to MixDA techniques and application reviews while providing promising directions for researchers interested in this exciting area.

翻译：在现代机器学习和深层神经网络中,数据增强(DA)是不可或缺的。DA的基本想法是建立新的培训数据来改进模型的概括化,方法是增加对现有数据稍有干扰的版本或合成新数据。在这项工作中,我们审查DA -- -- 以Mix为基础的数据增强(MixDA)的一个小型但必不可少的子集,通过混合多个实例生成新的样本。与传统的DA方法不同,MixDA基于单一样本操作或需要域知识,MixDA在创造广泛的新数据方面比较普遍,在社区中受到越来越多的注意。我们首先提出将MixDA分类为、基于混合的、基于Cutmix的和混合的新数据组合。我们审查了一个新的分类学方法,根据数据组合的等级,对DA -- -- 混合数据增强数据增强(MixDA) (Mix) (Mix) (Mix DA) (Mix (Mix) (Mix) (Mix (Mix) (Mix) (Mix (Mix) (Mix) (Mix (Mix) (Mix (Mix) (Mix) (Mix) ) (Mix (Mix (Mix) ) ) (L) ) (L) ) (L) (L) (L) (L) (L) (L) ) (L) (L) 的模型 ) 的模型和(S) 的模型的当前分析方法进行了一个基础研究, 的模型分析分析, 的模型分析, 和的模型分析方法, 和的模型分析方法, 的模型的模型,最后解释。