As an effective strategy, data augmentation (DA) alleviates data scarcity scenarios where deep learning techniques may fail. It is widely applied in computer vision then introduced to natural language processing and achieves improvements in many tasks. One of the main focuses of the DA methods is to improve the diversity of training data, thereby helping the model to better generalize to unseen testing data. In this survey, we frame DA methods into three categories based on the diversity of augmented data, including paraphrasing, noising, and sampling. Our paper sets out to analyze DA methods in detail according to the above categories. Further, we also introduce their applications in NLP tasks as well as the challenges. Some helpful resources are provided in the appendix.
翻译:作为一项有效的战略,数据增强(DA)可以缓解深层学习技术可能失败的数据稀缺情况,在计算机视野中广泛应用,然后引入自然语言处理,并改进许多任务,DA方法的主要重点之一是改进培训数据的多样性,从而帮助模型更好地概括到看不见的测试数据,在这项调查中,我们将DA方法分为三类,其依据是扩大数据的多样性,包括分解、点名和抽样。我们的文件根据上述类别详细分析DA方法。此外,我们还将在NLP任务中应用这些方法以及挑战。一些有用的资源在附录中提供。