Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods. Recent findings have shown that the accuracy and robustness of unsupervised word translation (UWT) can be improved by making use of visual observations, which are universal representations across languages. In this work, we investigate the potential of using not only visual observations but also pretrained language-image models for enabling a more efficient and robust UWT. Specifically, we develop a novel UWT method dubbed Word Alignment using Language-Image Pretraining (WALIP), which leverages visual observations via the shared embedding space of images and texts provided by CLIP models (Radford et al., 2021). WALIP has a two-step procedure. First, we retrieve word pairs with high confidences of similarity, computed using our proposed image-based fingerprints, which define the initial pivot for the word alignment. Second, we apply our robust Procrustes algorithm to estimate the linear mapping between two embedding spaces, which iteratively corrects and refines the estimated alignment. Our extensive experiments show that WALIP improves upon the state-of-the-art performance of bilingual word alignment for a few language pairs across different word embeddings and displays great robustness to the dissimilarity of language pairs or training corpora for two word embeddings.
翻译:近期的研究结果显示,通过使用视觉观测,可以提高不受监督的字词翻译(UWT)的准确性和稳健性,这些观测是各种语言的普遍代表。在这项工作中,我们调查不仅使用视觉观测,而且使用预先培训的语言图像模型的潜力,以便建立一个更有效和更强大的UWT。具体地说,我们开发了一种新型的UWT方法,即使用语言图像预培训(WALIP)来利用CLIP模型(Radford等人,2021年)提供的图像和文本的共享嵌入空间进行视觉观测。WALIP有一个两步程序。首先,我们用非常信任的相似性来检索词配对,使用我们拟议的基于图像的指纹来计算,用以界定对词校准的初始点。第二,我们运用我们强大的Procrutyt 算法来估计两个嵌入空间之间的线性制图,这些空间反复校正和完善了估计的校正。我们的广泛实验显示,WALIPPS改进了两个硬性双联词的硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性的硬性硬性硬性硬性硬性硬性的硬性的硬性的硬性能。