Data scarcity is a problem that occurs in languages and tasks where we do not have large amounts of labeled data but want to use state-of-the-art models. Such models are often deep learning models that require a significant amount of data to train. Acquiring data for various machine learning problems is accompanied by high labeling costs. Data augmentation is a low-cost approach for tackling data scarcity. This paper gives an overview of current state-of-the-art data augmentation methods used for natural language processing, with an emphasis on methods for neural and transformer-based models. Furthermore, it discusses the practical challenges of data augmentation, possible mitigations, and directions for future research.
翻译:数据稀缺是一个以语言和任务出现的问题,在这些语言和任务中,我们没有大量贴标签的数据,但希望使用最先进的模型,这些模型往往是深层次的学习模型,需要大量的数据来培训。为各种机器学习问题获取数据的同时,还要付出高昂的标签成本。数据增强是解决数据稀缺问题的低成本方法。本文件概述了目前用于自然语言处理的最新数据增强方法,重点是神经和变压器模型的方法。此外,它讨论了数据增强、可能的缓解措施和未来研究方向等实际挑战。