Unsupervised Bilingual Dictionary Induction methods based on the initialization and the self-learning have achieved great success in similar language pairs, e.g., English-Spanish. But they still fail and have an accuracy of 0% in many distant language pairs, e.g., English-Japanese. In this work, we show that this failure results from the gap between the actual initialization performance and the minimum initialization performance for the self-learning to succeed. We propose Iterative Dimension Reduction to bridge this gap. Our experiments show that this simple method does not hamper the performance of similar language pairs and achieves an accuracy of 13.64~55.53% between English and four distant languages, i.e., Chinese, Japanese, Vietnamese and Thai.
翻译:基于初始化和自学的未受监督的双语词典诱导方法在类似语言对口(例如英语-西班牙语)中取得了巨大成功。 但是,在很多远程语言对口(例如英语-日语)中,这些方法仍然失败,准确率为0%。在这项工作中,我们表明,这一失败是由于实际初始化绩效与自学成功的最低初始化绩效之间存在差距。我们提议使用“迭代内容减少”来弥补这一差距。我们的实验表明,这一简单方法不会妨碍类似语言对口的绩效,而且英语和四种远程语言(即中文、日文、越南文和泰文)之间准确率为13.64%至55.53%。