The textual content of a document and its publication date are intertwined. For example, the publication of a news article on a topic is influenced by previous publications on similar issues, according to underlying temporal dynamics. However, it can be challenging to retrieve meaningful information when textual information conveys little. Furthermore, the textual content of a document is not always correlated to its temporal dynamics. We develop a method to create clusters of textual documents according to both their content and publication time, the Powered Dirichlet-Hawkes process (PDHP). PDHP yields significantly better results than state-of-the-art models when temporal information or textual content is weakly informative. PDHP also alleviates the hypothesis that textual content and temporal dynamics are perfectly correlated. We demonstrate that PDHP generalizes previous work --such as DHP and UP. Finally, we illustrate a possible application using a real-world dataset from Reddit.
翻译:文档的文本内容及其出版日期是相互交织的。例如,关于某个专题的新闻报道的出版受到以前关于类似问题的出版物的影响,根据潜在的时间动态。然而,如果文本信息传达不多,则检索有意义的信息可能具有挑战性。此外,文件的文本内容并不总是与其时间动态相关。我们开发了一种方法,根据文件的内容和出版时间,即Powered Drichlet-Hawkes进程,创建文本文件群。当时间信息或文本内容信息信息不足时,PDHP产生的结果比最先进的模型要好得多。PDHP还减轻了文本内容和时间动态完全关联的假设。我们证明PDHP概括了以前的工作,例如DHP和UP。最后,我们用来自Reddit的真实世界数据集来说明可能的应用程序。