The abundant sequential documents such as online archival, social media and news feeds are streamingly updated, where each chunk of documents is incorporated with smoothly evolving yet dependent topics. Such digital texts have attracted extensive research on dynamic topic modeling to infer hidden evolving topics and their temporal dependencies. However, most of the existing approaches focus on single-topic-thread evolution and ignore the fact that a current topic may be coupled with multiple relevant prior topics. In addition, these approaches also incur the intractable inference problem when inferring latent parameters, resulting in a high computational cost and performance degradation. In this work, we assume that a current topic evolves from all prior topics with corresponding coupling weights, forming the multi-topic-thread evolution. Our method models the dependencies between evolving topics and thoroughly encodes their complex multi-couplings across time steps. To conquer the intractable inference challenge, a new solution with a set of novel data augmentation techniques is proposed, which successfully discomposes the multi-couplings between evolving topics. A fully conjugate model is thus obtained to guarantee the effectiveness and efficiency of the inference technique. A novel Gibbs sampler with a backward-forward filter algorithm efficiently learns latent timeevolving parameters in a closed-form. In addition, the latent Indian Buffet Process (IBP) compound distribution is exploited to automatically infer the overall topic number and customize the sparse topic proportions for each sequential document without bias. The proposed method is evaluated on both synthetic and real-world datasets against the competitive baselines, demonstrating its superiority over the baselines in terms of the low per-word perplexity, high coherent topics, and better document time prediction.
翻译:大量连续文件,如在线档案、社交媒体和新闻资料正在不断更新,每个文件的每个部分都被纳入平稳演变但又依赖的专题。这些数字文本吸引了对动态主题模型的广泛研究,以推断隐藏的演变专题及其时间依赖性。然而,大多数现有方法侧重于单一专题的演变,忽视了以下事实,即当前专题可能与多个相关先前专题同时出现。此外,这些方法在推断潜在参数时也产生了难以解决的推论问题,从而导致计算基线和性能退化的高比重。在这项工作中,我们假设一个当前专题从以往所有专题演变成相应的混合权重,形成多专题的演变演变演变和时间依赖性演变。我们的方法模型将演变的专题和它们复杂的多相交错的演变过程结合起来。为了克服难解的推论,提出了一套新数据增强技术的新解决方案,成功地消除了不断演变的专题之间的多相交错。因此,一个完全的计算模型可以保证当前所有专题的效益和效率,同时进行相应的混合权重,形成多位权重权重权重权重,我们用一个方法来预测一个不断更新的周期周期周期的周期的模型。