Few-shot text classification is a fundamental NLP task in which a model aims to classify text into a large number of categories, given only a few training examples per category. This paper explores data augmentation -- a technique particularly suitable for training with limited data -- for this few-shot, highly-multiclass text classification setting. On four diverse text classification tasks, we find that common data augmentation techniques can improve the performance of triplet networks by up to 3.0% on average. To further boost performance, we present a simple training strategy called curriculum data augmentation, which leverages curriculum learning by first training on only original examples and then introducing augmented data as training progresses. We explore a two-stage and a gradual schedule, and find that, compared with standard single-stage training, curriculum data augmentation trains faster, improves performance, and remains robust to high amounts of noising from augmentation.
翻译:微小的文本分类是一项基本的NLP任务,在其中,一个模型的目的是将文本分为许多类别,每个类别只有几个培训实例。本文探讨数据扩充 -- -- 一种特别适合以有限数据进行的培训的技术 -- -- 用于这种微小的、高度多级的文本分类设置。关于四种不同的文本分类任务,我们发现,共同的数据扩充技术可以提高三重网络的性能,平均提高3.0%。为了进一步提高绩效,我们提出了一个简单的培训战略,称为课程数据扩充,它利用课程的学习,首先仅对原始实例进行培训,然后将扩大的数据作为培训进展。我们探索了两个阶段和逐步的时间表,发现与标准的单阶段培训相比,课程数据扩充列车的速度加快,提高性能,提高性能,并保持强劲性,使从增强中大量产生噪音。