Data scarcity and data imbalance have attracted a lot of attention in many fields. Data augmentation, explored as an effective approach to tackle them, can improve the robustness and efficiency of classification models by generating new samples. This paper presents REPRINT, a simple and effective hidden-space data augmentation method for imbalanced data classification. Given hidden-space representations of samples in each class, REPRINT extrapolates, in a randomized fashion, augmented examples for target class by using subspaces spanned by principal components to summarize distribution structure of both source and target class. Consequently, the examples generated would diversify the target while maintaining the original geometry of target distribution. Besides, this method involves a label refinement component which allows to synthesize new soft labels for augmented examples. Compared with different NLP data augmentation approaches under a range of data imbalanced scenarios on four text classification benchmark, REPRINT shows prominent improvements. Moreover, through comprehensive ablation studies, we show that label refinement is better than label-preserving for augmented examples, and that our method suggests stable and consistent improvements in terms of suitable choices of principal components. Moreover, REPRINT is appealing for its easy-to-use since it contains only one hyperparameter determining the dimension of subspace and requires low computational resource.
翻译:数据稀缺和数据不平衡现象在许多领域引起了许多关注。数据增强,作为解决这些问题的有效方法,可以提高分类模型的稳健性和效率,通过生成新的样本。本文介绍了REPRINT,这是用于不平衡数据分类的一种简单而有效的隐藏空间数据增强方法。考虑到每类样本的隐蔽空间表示,REPRINT以随机方式外推,通过使用按主要组成部分划分的子空间来汇总源和目标类的分布结构来增加目标类的示例。因此,生成的示例将使得目标多样化,同时保持目标分布的原始几何分布。此外,这一方法涉及一个标签改进部分,以便能够合成新的软标签,用于放大示例。与不同的NLP数据增强方法相比,REPRINT在四个文本分类基准的一系列数据不平衡假设中显示出显著的改进。此外,通过全面的缩略图研究,我们表明,改进标签比保留标签以补充示例要好,而且我们的方法表明,在适当选择主要组成部分方面将出现稳定和一致的改进。此外,REPRINT要求对其低空间层面进行简单化的计算,因为其仅包含一个高空域的计算。