Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.
翻译:交叉语言主题建模已经成为跨语言文本分析的主要方法,通过揭示对齐的潜在主题实现。然而,大部分存在方法产生了具有重复性的主题,这阻碍了进一步的分析及由低覆盖词典引起的性能下降。本文提出了一种基于互信息的交叉语言主题建模方法(InfoCTM)。与之前的直接对齐方法不同,我们提出了一种基于互信息的主题对齐方法。该方法作为一种正则化,可以正确地对齐主题,并防止词汇的退化主题表示,从而缓解重复性主题问题。为了解决低覆盖词典问题,我们进一步提出了一种跨语言词汇链接方法,在给定词典的翻译之外,寻找更多的跨语言单词进行主题对齐。对英语、中文和日语数据集的广泛实验表明,我们的方法优于现有的基准方法,产生了更加连贯、多样化和良好对齐的主题,并展现出更好的跨语言分类任务的可移植性。