The Latent Dirichlet Allocation (LDA) model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values and the ability to make predictions for the abundance matrix using the information given by the covariates. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.
翻译:低端 Dirichlet分配(LDA) 模型是创建混合成员组群的流行方法。 尽管最初开发LDA是为了进行文本分析,但LDA已经用于广泛的其他应用。 我们为LDA模型提出了新的配方,其中含有共变体。在这个模型中,一个负的二进制回归嵌入LDA, 使得能够对回归系数进行直向前向解释,并分析每个取样单位的集成元素数量(而不是分析侧重于在结构主题模型中对每个组群的比例进行建模)。我们使用Gib抽样算法中的切片取样来估计模型参数。我们依靠模拟来显示我们的算法如何能够成功地检索真实参数值,以及利用共变数所提供的信息对丰度矩阵作出预测的能力。模型使用三个不同领域的真实数据集加以说明:科罗纳病毒物品的文本采矿、食品购物篮子分析以及巴罗科罗罗罗拉多岛(巴拿马)树种的生态学。这一模型使得能够识别离体数据中的混合成员组群集,并提供这些群群群的丰度。