Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at https://github.com/styfeng/DataAug4NLP
翻译:最近,由于在低资源领域、新任务和需要大量培训数据的大规模神经网络的普及性方面开展了更多的工作,数据扩增最近引起人们对国家实验室方案的兴趣增加,这是因为在低资源领域、新任务和需要大量培训数据的大规模神经网络的普及性方面开展了更多的工作。尽管最近出现了这种激增,但这一领域仍然相对没有得到充分探讨,这或许是由于语言数据各自为政的性质所构成的挑战。在本文件中,我们以结构化的方式总结文献,对国家实验室方案的数据扩增情况进行了全面和统一的调查。我们首先为国家实验室方案引进并激励数据扩增,然后讨论具有主要方法代表性的方法。我们接着着重介绍了用于广受欢迎的国家实验室方案应用和任务的技术。我们最后概述了目前的挑战和未来研究的方向。总体而言,我们的文件旨在澄清国家实验室方案数据扩增方面现有文献的概况,并激励在这一领域开展更多的工作。我们还提出了一个GitHub数据库,其文件清单将在https://github.com/styfeng/DataAug4NP上不断更新。