We propose a novel approach for cross-lingual Named Entity Recognition (NER) zero-shot transfer using parallel corpora. We built an entity alignment model on top of XLM-RoBERTa to project the entities detected on the English part of the parallel data to the target language sentences, whose accuracy surpasses all previous unsupervised models. With the alignment model we can get pseudo-labeled NER data set in the target language to train task-specific model. Unlike using translation methods, this approach benefits from natural fluency and nuances in target-language original corpus. We also propose a modified loss function similar to focal loss but assigns weights in the opposite direction to further improve the model training on noisy pseudo-labeled data set. We evaluated this proposed approach over 4 target languages on benchmark data sets and got competitive F1 scores compared to most recent SOTA models. We also gave extra discussions about the impact of parallel corpus size and domain on the final transfer performance.
翻译:我们建议采用一种新颖的方法,使用平行的Corpora,进行跨语言命名实体识别零光传输。我们在XLM-ROBERTA之上建立了一个实体调整模型,以预测在目标语言句平行数据英文部分发现的实体,其准确性超过以往所有未受监督的模型。通过该校准模型,我们可以获得目标语言的假标签实体识别零光传输数据集,以培训具体任务模式。与使用翻译方法不同,这一方法得益于目标语言原体的自然流畅和细微差别。我们还提出了类似于焦点损失的修改损失函数,但将重量分配到相反的方向,以进一步改进关于噪音伪标签数据集的示范培训。我们评估了基准数据集的这一拟议方法,并在F1中获得了与最近SOTA模型相比的竞争性分数。我们还就平行的系统大小和域对最终传输性能的影响进行了多次讨论。