Data augmentation techniques are widely used in text classification tasks to improve the performance of classifiers, especially in low-resource scenarios. Most previous methods conduct text augmentation without considering the different functionalities of the words in the text, which may generate unsatisfactory samples. Different words may play different roles in text classification, which inspires us to strategically select the proper roles for text augmentation. In this work, we first identify the relationships between the words in a text and the text category from the perspectives of statistical correlation and semantic similarity and then utilize them to divide the words into four roles -- Gold, Venture, Bonus, and Trivial words, which have different functionalities for text classification. Based on these word roles, we present a new augmentation technique called STA (Selective Text Augmentation) where different text-editing operations are selectively applied to words with specific roles. STA can generate diverse and relatively clean samples, while preserving the original core semantics, and is also quite simple to implement. Extensive experiments on 5 benchmark low-resource text classification datasets illustrate that augmented samples produced by STA successfully boost the performance of classification models which significantly outperforms previous non-selective methods, including two large language model-based techniques. Cross-dataset experiments further indicate that STA can help the classifiers generalize better to other datasets than previous methods.
翻译:在文本分类任务中,数据增强技术被广泛用于提高分类员的性能,特别是在低资源情景中。大多数以往的方法都是在不考虑文本中文字的不同功能的情况下进行文本增强,这可能会产生不令人满意的样本。不同的词在文本分类中可以发挥不同的作用,这激励我们从战略角度选择文本增强的适当作用。在这项工作中,我们首先从统计相关性和语义相似性的角度出发,确定文本中的文字和文本类别之间的关系,然后利用它们将词分为四个角色 -- -- 黄金、风险、博努斯和Trivial字,这四个角色具有不同的文本分类功能。根据这些词的作用,我们提出了一个新的增强技术,称为STA(选择文本增强),在文本编辑操作上有选择地适用于具有特定作用的词。STA可以产生多样化和相对干净的样本,同时保留原始的核心语义和语义相似性,并且非常容易实施。关于5个基准的低资源分类数据集的广泛实验表明,STA生产的样本成功地提高了文本分类模型的性能大大超过以前的非选择性方法,包括前两个大语言分类方法。