Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area.
翻译:最近,由于在低资源领域、新任务和需要大量培训数据的大规模神经网络的普及性方面开展了更多的工作,数据扩增最近引起了人们对国家实验室方案的兴趣。尽管最近出现了这种激增,但这一领域仍然相对没有得到充分探讨,这或许是由于语言数据各自为政构成的挑战。在本文件中,我们通过有条不紊地总结文献,对国家实验室方案的数据扩增进行全面和统一调查。我们首先为国家实验室方案引进和激励数据扩增,然后讨论具有代表性的主要方法。接下来,我们强调用于广受欢迎的国家实验室方案应用和任务的技术。我们最后通过概述目前的挑战和未来研究的方向。总体而言,我们的文件旨在澄清国家实验室方案在数据扩增方面现有文献的概况,并激励在这一领域开展更多的工作。