Analysis of short text, such as social media posts, is extremely difficult because it relies on observing many document-level word co-occurrence pairs. Beyond topic distributions, a common downstream task of the modeling is grouping the authors of these documents for subsequent analyses. Traditional models estimate the document groupings and identify user clusters with an independent procedure. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document, with user-level topic distributions. We also simultaneously cluster users, removing the need for post-hoc cluster estimation and improving topic estimation by shrinking noisy user-level topic distributions towards typical values. Our method performs as well as -- or better -- than traditional approaches to problems arising in short text, and we demonstrate its usefulness on a dataset of tweets from United States Senators, recovering both meaningful topics and clusters that reflect partisan ideology.
翻译:社交媒体文章等短文分析极其困难,因为它依赖于观察许多文件级的单词共发。除了专题分布外,模型的普通下游任务是将这些文件的作者分组,以便随后进行分析。传统模型估计文件分组,用独立程序确定用户群。我们提出了一个新颖的模式,通过在同一文件中的文字之间建模高度依赖来扩大 " 冷冻点分配 ",同时使用用户级专题分布。我们还同时对用户进行分组,通过减少用户级的吵闹主题分布来缩小对典型价值观的估算和改进专题估计。我们的方法比处理短文本中出现的问题的传统方法还要好,我们用美国参议员的推文数据集来显示其有用性,既恢复有意义的专题,又恢复反映党派意识形态的群集。