NLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP.
翻译:过去十年来,通过使用神经模型和大标记数据集,国家实验室方案取得了巨大进展。对大量数据的依赖使国家实验室方案模型无法应用于低资源设置或新任务,因为需要大量的时间、金钱或专门知识来标注大量文本数据。最近,探索了数据增强方法,作为提高国家实验室数据效率的一种手段。迄今为止,在有标签的有限数据设置中,没有系统地对国家实验室方案的数据增强情况进行实证性审查,因此难以了解在哪些情况下采用何种方法。在本文件中,我们提供了对国家实验室方案在有限标签数据设置中的数据增强情况的最新进展进行的经验性调查,总结了方法的全貌(包括象征性的增强、句级增强、对抗性增强和隐藏空间增强),并对11个数据集进行了实验,其中包括专题/新分类、推断任务、参数任务和单项任务。根据结果,我们得出若干结论,以帮助从业人员在不同环境中选择适当的增强情况,并讨论当前在NP学习中有限的数据方面的挑战和未来方向。