Current work in named entity recognition (NER) shows that data augmentation techniques can produce more robust models. However, most existing techniques focus on augmenting in-domain data in low-resource scenarios where annotated data is quite limited. In contrast, we study cross-domain data augmentation for the NER task. We investigate the possibility of leveraging data from high-resource domains by projecting it into the low-resource domains. Specifically, we propose a novel neural architecture to transform the data representation from a high-resource to a low-resource domain by learning the patterns (e.g. style, noise, abbreviations, etc.) in the text that differentiate them and a shared feature space where both domains are aligned. We experiment with diverse datasets and show that transforming the data to the low-resource domain representation achieves significant improvements over only using data from high-resource domains.
翻译:目前命名为实体识别(NER)的工作表明,数据增强技术可以产生更健全的模型,然而,大多数现有技术侧重于在附加说明的数据非常有限的低资源情景下,在低资源情景下,增加内部数据;相反,我们为NER任务研究跨领域数据增强;我们调查利用高资源领域数据的可能性,将数据投射到低资源领域;具体地说,我们提出一个新的神经结构,将数据代表制从高资源领域转变为低资源领域,方法是学习将数据代表制(如风格、噪音、缩略语等)在文本中加以区分的模式,以及两个领域一致的共同特征空间。我们实验了不同的数据集,并表明将数据转换为低资源领域代表制,仅利用高资源领域的数据,就能取得显著的改进。