In this work, we take the named entity recognition task in the English language as a case study and explore style transfer as a data augmentation method to increase the size and diversity of training data in low-resource scenarios. We propose a new method to effectively transform the text from a high-resource domain to a low-resource domain by changing its style-related attributes to generate synthetic data for training. Moreover, we design a constrained decoding algorithm along with a set of key ingredients for data selection to guarantee the generation of valid and coherent data. Experiments and analysis on five different domain pairs under different data regimes demonstrate that our approach can significantly improve results compared to current state-of-the-art data augmentation methods. Our approach is a practical solution to data scarcity, and we expect it to be applicable to other NLP tasks.
翻译:在这项工作中,我们把英文名称实体确认任务作为案例研究,并探索风格传输作为一种数据增强方法,以增加低资源情景下培训数据的规模和多样性;我们提出一种新的方法,通过改变与风格有关的属性,生成用于培训的合成数据,有效地将文本从高资源领域转变为低资源领域;此外,我们设计了一套有限的解码算法和一套关键数据选择要素,以确保生成有效和一致的数据;在不同数据制度下对五对不同域对的实验和分析表明,我们的方法可以大大改进与目前最新数据增强方法相比的结果;我们的方法是实际解决数据稀缺问题,我们期望它适用于其他非数据扩展方案的任务。