Recent advances in cross-lingual word embeddings have primarily relied on mapping-based methods, which project pretrained word embeddings from different languages into a shared space through a linear transformation. However, these approaches assume word embedding spaces are isomorphic between different languages, which has been shown not to hold in practice (S{\o}gaard et al., 2018), and fundamentally limits their performance. This motivates investigating joint learning methods which can overcome this impediment, by simultaneously learning embeddings across languages via a cross-lingual term in the training objective. Given the abundance of parallel data available (Tiedemann, 2012), we propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word and sentence representations. Our approach significantly improves cross-lingual sentence retrieval performance over all other approaches, as well as convincingly outscores mapping methods while maintaining parity with jointly trained methods on word-translation. It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task, requiring far fewer computational resources for training and inference. As an additional advantage, our bilingual method also improves the quality of monolingual word vectors despite training on much smaller datasets. We make our code and models publicly available.
翻译:在跨语言嵌入方面最近取得的进展主要依靠基于绘图的方法,这些方法预测了通过线性转换将不同语言预设的字嵌入到共享空间的预先培训的字嵌入方法;然而,这些方法假定,将字嵌入空间在不同语言之间是无差异的,这在实践上证明无法维持(S@o}gaard等人,2018年),从根本上限制了其绩效。这促使调查联合学习方法,通过在培训目标中跨语言术语同时学习跨语言嵌入,从而克服这一障碍。鉴于现有大量平行数据(Tiedemann,2012年),我们提议以双语方式扩展CBOW方法,该方法利用与判决一致的公司获得强有力的跨语言的字句和句表达方式。我们的方法大大改进了所有其他方法的跨语言的句检索绩效,并令人信服地超越了核心的绘图方法,同时保持了与经过共同培训的文字翻译方法的等同性。它还实现了与深 RNN方法在零光跨语言文件分类任务上的等同性方法(Tidemann,2012年),因此培训和推断所需的计算资源要少得多。我们现有的双语数据模型也提高了。