Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for fields such as machine translation. In this work, we investigate the application of word and sentence alignment techniques coupled with a matching algorithm to align cross-lingual entity labels extracted from Wikidata in 10 languages. Our results indicate that mapping between Wikidata's main labels stands to be considerably improved (up to $20$ points in F1-score) by any of the employed methods. We show how methods relying on sentence embeddings outperform all others, even across different scripts. We believe the application of such techniques to measure the similarity of label pairs, coupled with a knowledge base rich in high-quality entity labels, to be an excellent asset to machine translation.
翻译:维基数据等知识库收集了大量实体名称信息,例如多语种标签,这些标签对于多种多语种和跨语种应用极为有用。然而,这些标签不能保证从信息一致性的角度对不同语言进行匹配,大大削弱了其在机器翻译等领域的实用性。在这项工作中,我们调查了单词和句校正技术的应用情况,以及用匹配算法将取自维基数据、10种语言的跨语种实体标签统一起来。我们的结果表明,维基数据主要标签之间的制图工作,无论采用何种方法,都可以大大改进(在F1 - score中高达20美元点) 。我们展示了依赖嵌入句的方法如何超越所有其他方法,甚至在不同的脚本中也是如此。我们相信,应用这些技术来测量标签配对的相似性,加上丰富高质量实体标签的知识库,是机器翻译的极佳资产。