Recently, Neural Topic Models (NTMs) inspired by variational autoencoders have obtained increasingly research interest due to their promising results on text analysis. However, it is usually hard for existing NTMs to achieve good document representation and coherent/diverse topics at the same time. Moreover, they often degrade their performance severely on short documents. The requirement of reparameterisation could also comprise their training quality and model flexibility. To address these shortcomings, we present a new neural topic model via the theory of optimal transport (OT). Specifically, we propose to learn the topic distribution of a document by directly minimising its OT distance to the document's word distributions. Importantly, the cost matrix of the OT distance models the weights between topics and words, which is constructed by the distances between topics and words in an embedding space. Our proposed model can be trained efficiently with a differentiable loss. Extensive experiments show that our framework significantly outperforms the state-of-the-art NTMs on discovering more coherent and diverse topics and deriving better document representations for both regular and short texts.
翻译:最近,由于在文本分析方面有希望的结果,受变异自动代言人启发的神经主题模型(NTMs)获得了越来越多的研究兴趣,然而,对于现有的NTMs来说,通常很难同时取得良好的文件代表性和连贯/多样化的主题;此外,它们往往在短短的文件中严重地降低其性能;再校准的要求还包括其培训质量和模型灵活性;为了解决这些缺陷,我们通过最佳运输理论(OT)提出了一个新的神经主题模型。具体地说,我们提议通过将OT距离直接减少到文件的文字分布,来了解文件的发行主题。重要的是,OT距离的成本矩阵模型将主题和文字之间的重量模型建在嵌入空间的专题和文字之间的距离上。我们提议的模型可以有效地培训出不同的损失。广泛的实验表明,我们的框架大大超越了最先进的NTMs在发现更加连贯和多样化的专题和为常规文本和短文本提供更好的文件表述方面。