Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing a model's generalization capabilities, it can also address many other challenges and problems, from overcoming a limited amount of training data, to regularizing the objective, to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation and a taxonomy for existing works, this survey is concerned with data augmentation methods for textual classification and aims to provide a concise and comprehensive overview for researchers and practitioners. Derived from the taxonomy, we divide more than 100 methods into 12 different groupings and give state-of-the-art references expounding which methods are highly promising by relating them to each other. Finally, research perspectives that may constitute a building block for future work are provided.
翻译:数据增强,即人为地为转换后的机器学习创造培训数据,是一个跨机械学习学科的研究领域,是一个广泛研究的研究领域,它虽然有助于增强模型的概括性能力,但也能够解决许多其他挑战和问题,从克服有限的培训数据,到使目标正规化,到限制用于保护隐私的数据数量,根据对数据增强的目标和应用的准确描述以及现有工作的分类,本调查涉及文本分类的数据增强方法,目的是为研究人员和从业者提供一个简明和全面的概览。