The state of art natural language processing systems relies on sizable training datasets to achieve high performance. Lack of such datasets in the specialized low resource domains lead to suboptimal performance. In this work, we adapt backtranslation to generate high quality and linguistically diverse synthetic data for low-resource named entity recognition. We perform experiments on two datasets from the materials science (MaSciP) and biomedical domains (S800). The empirical results demonstrate the effectiveness of our proposed augmentation strategy, particularly in the low-resource scenario.
翻译:现代自然语言处理系统的状况依靠大量培训数据集才能取得高性能。在专门的低资源领域缺乏这类数据集,导致业绩欠佳。在这项工作中,我们调整回译,以产生高质量和语言多样性的合成数据,供低资源命名实体识别。我们实验了材料科学(MaSciP)和生物医学领域(S800)的两套数据集。经验结果显示,我们提议的扩增战略,特别是在低资源情景下,是有效的。