Parallel corpora are ideal for extracting a multilingual named entity (MNE) resource, i.e., a dataset of names translated into multiple languages. Prior work on extracting MNE datasets from parallel corpora required resources such as large monolingual corpora or word aligners that are unavailable or perform poorly for underresourced languages. We present CLC-BN, a new method for creating an MNE resource, and apply it to the Parallel Bible Corpus, a corpus of more than 1000 languages. CLC-BN learns a neural transliteration model from parallel-corpus statistics, without requiring any other bilingual resources, word aligners, or seed data. Experimental results show that CLC-BN clearly outperforms prior work. We release an MNE resource for 1340 languages and demonstrate its effectiveness in two downstream tasks: knowledge graph augmentation and bilingual lexicon induction.
翻译:为了提取多语种名称实体(MNE)资源,即翻译成多种语言的地名数据集,理想的平行公司是提取多语种名称实体(MNE)资源。以前从平行公司提取MNE数据集的工作需要大量资源,如大型单一语言公司或对资源不足的语言来说不可用或表现不佳的单词匹配者。我们介绍了创建MNE资源的新方法CLC-BN, 并将其应用于具有1000多种语言的平行圣经公司(Corpus)。CLC-BN从平行公司统计数据中学习了一个神经转写模式,而不需要任何其他双语资源、词匹配者或种子数据。实验结果表明,CLC-BN显然超越了先前的工作。我们为1340种语言发放了MNE资源,并展示了其在两个下游任务(知识图表增强和双语词汇介绍)中的有效性。