Topic modeling enables exploration and compact representation of a corpus. The CaringBridge (CB) dataset is a massive collection of journals written by patients and caregivers during a health crisis. Topic modeling on the CB dataset, however, is challenging due to the asynchronous nature of multiple authors writing about their health journeys. To overcome this challenge we introduce the Dynamic Author-Persona topic model (DAP), a probabilistic graphical model designed for temporal corpora with multiple authors. The novelty of the DAP model lies in its representation of authors by a persona --- where personas capture the propensity to write about certain topics over time. Further, we present a regularized variational inference algorithm, which we use to encourage the DAP model's personas to be distinct. Our results show significant improvements over competing topic models --- particularly after regularization, and highlight the DAP model's unique ability to capture common journeys shared by different authors.
翻译:主题模型可以进行探索, 并代表物质。 CaringBridge (CB) 数据集是大量收集病人和护理者在健康危机期间撰写的期刊。 但是,由于多位作者在撰写有关其健康历程的论文时的不同步性质,在CB数据集上的专题模型具有挑战性。 为了克服这一挑战,我们引入了动态作者- 个人专题模型(DAP),这是一个为时间体子作者设计的概率图形模型(DAP ) 。 DAP 模型的新颖之处在于由一个人来代表作者 -- -- 个人捕捉着随着时间的推移撰写某些专题的倾向。 此外,我们提出了一个常规化的变异推算法,我们用来鼓励DAP 模型的人有区别。我们的结果显示,在相竞争的专题模型上有了显著的改进 -- 特别是在正规化之后, 并突出DAP 模型在捕捉不同作者共同行程方面的独特能力。