Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language processing (NLP) tasks. This challenge is especially prominent in the few-shot learning scenario, where the data in the target domain is generally much scarcer and of lowered quality. A natural and widely-used strategy to mitigate such challenges is to perform data augmentation to better capture the data invariance and increase the sample size. However, current text data augmentation methods either can't ensure the correct labeling of the generated data (lacking faithfulness) or can't ensure sufficient diversity in the generated data (lacking compactness), or both. Inspired by the recent success of large language models, especially the development of ChatGPT, which demonstrated improved language comprehension abilities, in this work, we propose a text data augmentation approach based on ChatGPT (named AugGPT). AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training. Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach over state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.
翻译:文本数据增强是克服许多自然语言处理 (NLP) 任务中样本限制挑战的有效策略。这个挑战在小样本学习情景中尤其突出,在此情景下,目标领域的数据通常更少且质量较差。缓解这些挑战的自然且广泛使用的策略是进行数据增强,以更好地捕捉数据的不变性并增加样本量。然而,当前的文本数据增强方法要么不能确保所生成数据的正确标注(缺乏忠实度),要么不能确保生成的数据具有足够的多样性(缺乏紧凑性),或者两者都有不足之处。受到近期大型语言模型的成功启发,尤其是 ChatGPT 的发展和卓越的语言理解能力,本文提出了一种基于 ChatGPT 的文本数据增强方法(名为 AugGPT)。AugGPT 将每个训练样本中的句子重新构造为多个概念上相似但语义上不同的样本。增强后的样本可以用于模型训练。小样本学习文本分类任务的实验结果显示,所提出的 AugGPT 方法在测试准确性和增强样本分布方面优于现有文本数据增强方法。