Sans a dwindling number of monolingual embedding studies originating predominantly from the low-resource domains, it is evident that multilingual embedding has become the de facto choice due to its adaptability to the usage of code-mixed languages, granting the ability to process multilingual documents in a language-agnostic manner, as well as removing the difficult task of aligning monolingual embeddings. But is this victory complete? Are the multilingual models better than aligned monolingual models in every aspect? Can the higher computational cost of multilingual models always be justified? Or is there a compromise between the two extremes? Bilingual Lexicon Induction is one of the most widely used metrics in terms of evaluating the degree of alignment between two embedding spaces. In this study, we explore the strengths and limitations of BLI as a measure to evaluate the degree of alignment of two embedding spaces. Further, we evaluate how well traditional embedding alignment techniques, novel multilingual models, and combined alignment techniques perform BLI tasks in the contexts of both high-resource and low-resource languages. In addition to that, we investigate the impact of the language families to which the pairs of languages belong. We identify that BLI does not measure the true degree of alignment in some cases and we propose solutions for them. We propose a novel stem-based BLI approach to evaluate two aligned embedding spaces that take into account the inflected nature of languages as opposed to the prevalent word-based BLI techniques. Further, we introduce a vocabulary pruning technique that is more informative in showing the degree of the alignment, especially performing BLI on multilingual embedding models. Often, combined embedding alignment techniques perform better while in certain cases multilingual embeddings perform better (mainly low-resource language cases).
翻译:尽管源自低资源领域的单语嵌入研究数量逐渐减少,但多语言嵌入显然已成为事实上的选择,因为它能适应代码混合语言的使用,具备以语言无关的方式处理多语言文档的能力,同时避免了单语嵌入对齐的困难任务。然而,这一胜利是否彻底?多语言模型是否在所有方面都优于对齐的单语模型?多语言模型较高的计算成本是否总能被证明合理?抑或是在两个极端之间存在折中方案?双语词典归纳是评估两个嵌入空间对齐程度最广泛使用的指标之一。在本研究中,我们探讨了BLI作为评估两个嵌入空间对齐程度度量的优势与局限性。进一步,我们评估了传统嵌入对齐技术、新型多语言模型以及组合对齐技术在高资源和低资源语言语境下执行BLI任务的表现。此外,我们还研究了语言对所属语系的影响。我们发现BLI在某些情况下并不能真实反映对齐程度,并为此提出了解决方案。我们提出了一种新颖的基于词干的BLI方法,以评估两个对齐的嵌入空间,该方法考虑了语言的屈折特性,与当前主流的基于词汇的BLI技术形成对比。此外,我们引入了一种词汇剪枝技术,该技术能更有效地展示对齐程度,特别是在多语言嵌入模型上执行BLI时。通常,组合嵌入对齐技术表现更佳,但在某些情况下(主要是低资源语言案例)多语言嵌入表现更好。