Pre-trained language models have led to a new state-of-the-art in many NLP tasks. However, for topic modeling, statistical generative models such as LDA are still prevalent, which do not easily allow incorporating contextual word vectors. They might yield topics that do not align very well with human judgment. In this work, we propose a novel topic modeling and inference algorithm. We suggest a bag of sentences (BoS) approach using sentences as the unit of analysis. We leverage pre-trained sentence embeddings by combining generative process models with clustering. We derive a fast inference algorithm based on expectation maximization, hard assignments, and an annealing process. Our evaluation shows that our method yields state-of-the art results with relatively little computational demands. Our methods is more flexible compared to prior works leveraging word embeddings, since it provides the possibility to customize topic-document distributions using priors. Code is at \url{https://github.com/JohnTailor/BertSenClu}.
翻译:培训前语言模型在许多国家劳工规划任务中产生了新的最新水平的语言模型。然而,对于主题模型而言,LDA等统计基因模型仍然很普遍,难以纳入上下文文字矢量。这些模型可能会产生与人类判断不完全一致的专题。在这项工作中,我们建议采用新颖的专题模型和推论算法。我们建议用句子作为分析单位来使用一袋句子(BoS)法。我们通过将基因化过程模型与集群结合起来来利用预先训练的句子嵌入。我们根据预期最大化、硬性任务和肛门过程得出快速推论算法。我们的评估表明,我们的方法产生艺术状态结果,而计算要求相对较少。我们的方法比先前的用词嵌入计算法更灵活,因为它提供了利用前缀定制专题文件分布的可能性。代码在\url{https://github.com/JohnTailor/BertSenClu}。