Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate different unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a sufficiently large corpus, required to learn monolingual word embeddings. Experimental results on three different datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource intensive methods, displaying stability across different language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks.
翻译:承认语言之间的语义相似的句子或段落对许多任务有利,从跨语言的信息检索和图象探测到机器翻译。最近提出的预测短文本跨语言语义相似性的方法,但通过线性翻译模式,使用工具和资源(例如机器翻译系统、合成分析器或名称实体识别),许多语言(或语言配对)不存在类似的句子或段落。相比之下,我们提议采用一种不受监督的和非常资源密集的方法,衡量不同语言文本之间的语义相似性。在双语(或多语言)空间运作,我们通过线性翻译模式预测从一种语言到其他语言的矢量空间的连续文字矢量矢量(即字嵌入)。然后,我们根据语言在双语嵌入空间的矢量的相似性来调整语言,调查不同的语义相似性测量方法,利用双语嵌入和单词义对齐方法来衡量不同语言之间的语义相似性系。从双语(或多语言)空间运行,拟议采用比较方法将语言矢量的源向矢量矢量任务(即字型嵌入)预测,从一种语言的比义检测方法,然后将用来测量各种语言的智能智能测试,然后用一种语言对等语言的智能数据,以充分学习。