Data augmentation aims to enrich training samples for alleviating the overfitting issue in low-resource or class-imbalanced situations. Traditional methods first devise task-specific operations such as Synonym Substitute, then preset the corresponding parameters such as the substitution rate artificially, which require a lot of prior knowledge and are prone to fall into the sub-optimum. Besides, the number of editing operations is limited in the previous methods, which decreases the diversity of the augmented data and thus restricts the performance gain. To overcome the above limitations, we propose a framework named Text AutoAugment (TAA) to establish a compositional and learnable paradigm for data augmentation. We regard a combination of various operations as an augmentation policy and utilize an efficient Bayesian Optimization algorithm to automatically search for the best policy, which substantially improves the generalization capability of models. Experiments on six benchmark datasets show that TAA boosts classification accuracy in low-resource and class-imbalanced regimes by an average of 8.8% and 9.7%, respectively, outperforming strong baselines.
翻译:增加数据的目的是为了丰富培训样本,以缓解低资源或班级平衡情况下的过度适应问题。传统方法首先设计任务特定操作,如同义词替代,然后人为地预先设定相应的参数,如替换率,这需要许多先前的知识,容易落入次最佳状态。此外,编辑操作的数量在以前的方法中是有限的,这减少了增加的数据的多样性,从而限制了绩效收益。为了克服上述限制,我们提议了一个名为“文本自动启动”的框架,以建立一个组成和可学习的数据增强范式。我们认为,各种操作的组合是一种增强政策,并使用高效的巴耶西亚最佳化算法自动寻找最佳政策,这大大改进了模型的普及能力。对六个基准数据集的实验表明,TA提高低资源和班级平衡制度中的分类准确性,平均为8.8%和9.7%,比强基线强。