The technique of Cross-Lingual Word Embedding (CLWE) plays a fundamental role in tackling Natural Language Processing challenges for low-resource languages. Its dominant approaches assumed that the relationship between embeddings could be represented by a linear mapping, but there has been no exploration of the conditions under which this assumption holds. Such a research gap becomes very critical recently, as it has been evidenced that relaxing mappings to be non-linear can lead to better performance in some cases. We, for the first time, present a theoretical analysis that identifies the preservation of analogies encoded in monolingual word embeddings as a necessary and sufficient condition for the ground-truth CLWE mapping between those embeddings to be linear. On a novel cross-lingual analogy dataset that covers five representative analogy categories for twelve distinct languages, we carry out experiments which provide direct empirical support for our theoretical claim. These results offer additional insight into the observations of other researchers and contribute inspiration for the development of more effective cross-lingual representation learning strategies.
翻译:跨语言文字嵌入技术(CLWE)在应对低资源语言的自然语言处理挑战方面发挥着根本作用,它的主要方法假定嵌入之间的关系可以通过线性绘图来体现,但并未探讨这一假设所处的条件。这种研究差距最近变得非常关键,因为已经证明,放松绘图是非线性的,在某些情况下可以带来更好的表现。我们第一次提出理论分析,将保存以单语语言嵌入编码的模拟词确定为这些嵌入的CLWE地面真象绘图的必要和充分条件。关于涵盖12种不同语言五个代表性类比的新的跨语言类比类比数据集,我们进行了实验,为我们的理论主张提供了直接的经验支持。这些实验结果为其他研究人员的观察提供了更多的见解,并为制定更有效的跨语言教学战略提供了启发。