Cognates are variants of the same lexical form across different languages; for example 'fonema' in Spanish and 'phoneme' in English are cognates, both of which mean 'a unit of sound'. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages. Our approach introduces the use of context from a knowledge graph to generate improved feature representations for cognate detection. We, then, evaluate the impact of our cognate detection mechanism on neural machine translation (NMT), as a downstream task. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. Additionally, we create evaluation datasets for two more Indian languages, Konkani and Nepali. We observe an improvement of up to 18% points, in terms of F-score, for cognate detection. Furthermore, we observe that cognates extracted using our method help improve NMT quality by up to 2.76 BLEU. We also release our code, newly constructed datasets and cross-lingual models publicly.
翻译:科尼亚特语是不同语言的同一种词汇形式变异; 例如, 西班牙语中的“ fonema” 和英语中的“phoneme” 是科尼亚特语, 两者都意味着“ 一个声音单位 ” 。 在两种语言中自动检测科尼亚特语的任务可以帮助下游的NLP 任务, 如跨语言信息检索、 计算基因学和机器翻译等 。 在本文中, 我们展示了使用跨语言词嵌入14种印度语言中检测科纳特语。 我们的方法是使用一个知识图表的上下文来产生更好的特征显示, 以更好地检测 Cognate 质量。 然后, 我们评估我们的 Cognate 检测机制对神经机器翻译(NMT)的影响, 作为一项下游任务。 我们评估了我们如何在12种印度语言的富有挑战性的数据集( Sanskrit、 印地语、 Assames、 Oriya、 Kannada、 Gulti、 泰米尔、 Telguugu、 旁遮普、 Benghi、 Marathi 和Malyalam) 。 此外, 我们为两种印度语的升级的检测、 Kon- bk- 和Cocal- democal- deal- deal- deal- deal- com- deal- deal- deal- deal- deal- deal- co) 的18 a 和 We- side- servational- servol- supal- deal- side- servation- 。 我们观测了两个的18 。