ChatAug: 将聊天GPT 用于文本数据增强</s> (ChatAug: Leveraging ChatGPT for Text Data Augmentation)

Haixing Dai,Zhengliang Liu,Wenxiong Liao,Xiaoke Huang,Zihao Wu,Lin Zhao,Wei Liu,Ninghao Liu,Sheng Li,Dajiang Zhu,Hongmin Cai,Quanzheng Li,Dinggang Shen,Tianming Liu,Xiang Li

Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language processing (NLP) tasks. This challenge is especially prominent in the few-shot learning scenario, where the data in the target domain is generally much scarcer and of lowered quality. A natural and widely-used strategy to mitigate such challenges is to perform data augmentation on the training data to better capture the data invariance and increase the sample size. However, current text data augmentation methods either can not ensure the correct labeling of the generated data (lacking faithfulness) or can not ensure sufficient diversity in the generated data (lacking completeness), or both. Inspired by the recent success of large language models, especially the development of ChatGPT, which demonstrated improved language comprehension abilities, in this work, we propose a text data augmentation approach based on ChatGPT (named ChatAug). ChatGPT is trained on data with unparalleled linguistic richness and employs a reinforcement training process with large-scale human feedback, which endows the model with affinity to the naturalness of human language. Our text data augmentation approach ChatAug rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training. Experiment results on few-shot learning text classification tasks show the superior performance of the proposed ChatAug approach over state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.

翻译：增加文本数据是克服许多自然语言处理(NLP)任务中抽样规模有限的挑战的有效战略。这一挑战在少数的学习情景中尤为突出,因为目标领域的数据通常非常稀少,质量也较低。一个自然和广泛使用的缓解挑战战略是,在培训数据上进行数据扩大,以更好地获取数据差异并增加样本规模。然而,目前的文本数据增强方法要么不能确保所生成的数据的正确标签(增强性能),要么不能确保生成的数据具有足够的多样性(缺乏完整性),或者两者兼而有之。由于大型语言模型最近的成功,特别是热电联PT的发展,显示语言理解能力得到提高,我们在这项工作中,提议了以查特GPT(名为查塔奥格)为基础的文本增强方法。热联就语言丰富无比的数据进行了培训,并采用大规模人类反馈的强化培训程序,使模型与人类语言的自然性密切相关。我们的文本增强型样方法(缺乏完整性) 查塔格方法的每个词都由大型语言模型的精准性重新表述,但在随后的试样中,将高级文本的精准性测试的每个词都用于不同的试样。</s>