可比公司双语双语模式 (Bilingual Topic Models for Comparable Corpora)

Probabilistic topic models like Latent Dirichlet Allocation (LDA) have been previously extended to the bilingual setting. A fundamental modeling assumption in several of these extensions is that the input corpora are in the form of document pairs whose constituent documents share a single topic distribution. However, this assumption is strong for comparable corpora that consist of documents thematically similar to an extent only, which are, in turn, the most commonly available or easy to obtain. In this paper we relax this assumption by proposing for the paired documents to have separate, yet bound topic distributions. % a binding mechanism between the distributions of the paired documents. We suggest that the strength of the bound should depend on each pair's semantic similarity. To estimate the similarity of documents that are written in different languages we use cross-lingual word embeddings that are learned with shallow neural networks. We evaluate the proposed binding mechanism by extending two topic models: a bilingual adaptation of LDA that assumes bag-of-words inputs and a model that incorporates part of the text structure in the form of boundaries of semantically coherent segments. To assess the performance of the novel topic models we conduct intrinsic and extrinsic experiments on five bilingual, comparable corpora of English documents with French, German, Italian, Spanish and Portuguese documents. The results demonstrate the efficiency of our approach in terms of both topic coherence measured by the normalized point-wise mutual information, and generalization performance measured by perplexity and in terms of Mean Reciprocal Rank in a cross-lingual document retrieval task for each of the language pairs.

翻译：本文中,我们放宽了这一假设,建议配对文件有分开但有约束性的专题分发。% 是配对文件分发之间的一个约束机制。我们建议,交错文件的强度应取决于每对配对文件的语义相似性。为了估计以不同语言编写的文件的相似性,我们使用跨语言词嵌入的嵌入词,从浅线网络中学习。我们通过扩展两个主题模型来评估拟议的约束机制:双语调整LDA,接受对配对文件的语义、但有约束性的文件分发。我们建议,在配对文件的分发之间,%是一个具有约束力的机制。我们建议,交错文件的强度应取决于每对配对文件的语义相似性。为了估计以不同语言编写的文件的相似性,我们使用跨语言的词嵌入了浅线网络。我们通过扩展了两个主题模型来评估拟议的约束性机制:双语调整LDA,接受经计量的语义投入,以及将文本结构的一部分融入了语义性一致的语义结构中。我们建议,在每段的语义性一致的语义性部分中,用双义性文件的英语测试中,用双义标准测试,用一种语言测试,用一种语言测试,用一种语言测试的英语语言的英语语言的里程,用一种语言测试,用一种语言测试,用一种语言测试,用一种语言的英语语言的里程,用一种语言的里程,用一种语言对等语言的英语语言对等语言进行。