Text augmentation techniques are widely used in text classification problems to improve the performance of classifiers, especially in low-resource scenarios. Whilst lots of creative text augmentation methods have been designed, they augment the text in a non-selective manner, which means the less important or noisy words have the same chances to be augmented as the informative words, and thereby limits the performance of augmentation. In this work, we systematically summarize three kinds of role keywords, which have different functions for text classification, and design effective methods to extract them from the text. Based on these extracted role keywords, we propose STA (Selective Text Augmentation) to selectively augment the text, where the informative, class-indicating words are emphasized but the irrelevant or noisy words are diminished. Extensive experiments on four English and Chinese text classification benchmark datasets demonstrate that STA can substantially outperform the non-selective text augmentation methods.
翻译:文本扩增技术被广泛用于文字分类问题,以提高分类员的性能,特别是在低资源情景下。虽然设计了许多创造性文本扩增方法,但它们以非选择性的方式扩充了文本,这意味着不太重要或吵闹的词与内容丰富的词具有同样的增加机会,从而限制了扩增的性能。在这项工作中,我们系统地总结了三种作用关键词,这些关键词在文本分类方面有不同的功能,并设计了从文本中提取它们的有效方法。根据这些提取的关键字,我们建议STA(选择文本增强)有选择地增加文本,在强调信息性、分级说明性词但减少不相干或吵闹的字眼。关于四种英文和中文文本分类基准数据集的广泛实验表明STA可以大大优于非选择性文本扩增方法。