基于连续最优传输的断棍嵌入主题模型及其在文档流在线分析中的应用 (Stick-Breaking Embedded Topic Model with Continuous Optimal Transport for Online Analysis of Document Streams)

Online topic models are unsupervised algorithms to identify latent topics in data streams that continuously evolve over time. Although these methods naturally align with real-world scenarios, they have received considerably less attention from the community compared to their offline counterparts, due to specific additional challenges. To tackle these issues, we present SB-SETM, an innovative model extending the Embedded Topic Model (ETM) to process data streams by merging models formed on successive partial document batches. To this end, SB-SETM (i) leverages a truncated stick-breaking construction for the topic-per-document distribution, enabling the model to automatically infer from the data the appropriate number of active topics at each timestep; and (ii) introduces a merging strategy for topic embeddings based on a continuous formulation of optimal transport adapted to the high dimensionality of the latent topic space. Numerical experiments show SB-SETM outperforming baselines on simulated scenarios. We extensively test it on a real-world corpus of news articles covering the Russian-Ukrainian war throughout 2022-2023.

翻译：在线主题模型是一类用于识别随时间持续演化的数据流中潜在主题的无监督算法。尽管这些方法天然契合现实场景，但由于存在特定的额外挑战，与离线模型相比，它们受到的学界关注明显不足。为解决这些问题，我们提出了SB-SETM模型，该创新模型通过融合在连续部分文档批次上构建的模型，将嵌入主题模型（ETM）扩展至数据流处理。为此，SB-SETM（i）采用截断断棍构造处理文档-主题分布，使模型能够从数据中自动推断每个时间步的活跃主题数量；（ii）提出一种基于连续最优传输公式的主题嵌入融合策略，该策略适用于潜在主题空间的高维特性。数值实验表明，SB-SETM在模拟场景中优于基线模型。我们在2022-2023年期间涵盖俄乌战争的真实新闻语料库上对其进行了全面测试。