Text augmentation is one of the most effective techniques to solve the critical problem of insufficient data in text classification. Existing text augmentation methods achieve hopeful performance in few-shot text data augmentation. However, these methods usually lead to performance degeneration on public datasets due to poor quality augmentation instances. Our study shows that even employing pre-trained language models, existing text augmentation methods generate numerous low-quality instances and lead to the feature space shift problem in augmentation instances. However, we note that the pre-trained language model is good at finding low-quality instances provided that it has been fine-tuned on the target dataset. To alleviate the feature space shift and performance degeneration in existing text augmentation methods, we propose BOOSTAUG, which reconsiders the role of the language model in text augmentation and emphasizes the augmentation instance filtering rather than generation. We evaluate BOOSTAUG on both sentence-level text classification and aspect-based sentiment classification. The experimental results on seven commonly used text classification datasets show that our augmentation method obtains state-of-the-art performance. Moreover, BOOSTAUG is a flexible framework; we release the code which can help improve existing augmentation methods.
翻译:文本增强是解决文本分类中数据不足这一关键问题的最有效方法之一。现有的文本增强方法在微小文本数据增强中取得了有希望的性能。然而,这些方法通常会导致公共数据集的性能退化,因为质量增强实例质量差。我们的研究显示,即使使用预先培训的语言模型,现有的文本增强方法也产生了许多低质量实例,并导致在增强实例中出现特质空间转换问题。然而,我们注意到,预先培训的语言模型在发现低质量实例方面是好的,条件是它已经对目标数据集作了微调。为了减轻现有文本增强方法中的地貌空间变化和性能退化,我们提议BOOSTAUG, 重新考虑语言模型在文本增强中的作用,强调增强实例过滤而不是生成。我们评估BOOSTAUG在句级文本分类和基于侧面情绪的分类两方面都产生了许多实例。我们在七个常用文本分类数据集上的实验结果显示,我们的增强方法获得了最新性能。此外,BOOSTAUG是一个灵活的框架;我们发布了有助于改进现有增强方法的代码。