Analysis of short text, such as social media posts, is extremely difficult because of their inherent brevity. In addition to classifying topics of such posts, a common downstream task is grouping the authors of these documents for subsequent analyses. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document, with user-level topic distributions. We also simultaneously cluster users, removing the need for post-hoc cluster estimation and improving topic estimation by shrinking noisy user-level topic distributions towards typical values. Our method performs as well as -- or better -- than traditional approaches, and we demonstrate its usefulness on a dataset of tweets from United States Senators, recovering both meaningful topics and clusters that reflect partisan ideology. We also develop a novel measure of echo chambers among these politicians by characterizing insularity of topics discussed by groups of Senators and provide uncertainty quantification.
翻译:对短文的分析,例如社交媒体文章等,由于其固有的简洁性,极其困难。除了对此类文章的专题进行分类外,一项共同的下游任务是将这些文件的作者分组,以便随后进行分析。我们提出了一个新颖的模式,通过模拟同一文件中各词的高度依赖性来扩大《冷冻点分配》,并使用用户一级的专题分布。我们还同时将用户分组,通过减少对典型价值观的吵闹用户级专题分布来消除对集聚后估计的需求,并通过减少对专题的估计来改进专题估计。我们的方法比传统方法更有用,而且比传统方法更有用,我们还在一组美国参议员的推特数据集上展示其有用性,同时恢复反映党派意识形态的有意义的专题和组群。我们还开发了一套新颖的这些政治家之间的回声室,将各组参议员讨论的议题定性为孤立性,并提供不确定性的量化。