There is an escalating need for methods to identify latent patterns in text data from many domains. We introduce a new method to identify topics in a corpus and represent documents as topic sequences. Discourse Atom Topic Modeling draws on advances in theoretical machine learning to integrate topic modeling and word embedding, capitalizing on the distinct capabilities of each. We first identify a set of vectors ("discourse atoms") that provide a sparse representation of an embedding space. Atom vectors can be interpreted as latent topics: Through a generative model, atoms map onto distributions over words; one can also infer the topic that generated a sequence of words. We illustrate our method with a prominent example of underutilized text: the U.S. National Violent Death Reporting System (NVDRS). The NVDRS summarizes violent death incidents with structured variables and unstructured narratives. We identify 225 latent topics in the narratives (e.g., preparation for death and physical aggression); many of these topics are not captured by existing structured variables. Motivated by known patterns in suicide and homicide by gender, and recent research on gender biases in semantic space, we identify the gender bias of our topics (e.g., a topic about pain medication is feminine). We then compare the gender bias of topics to their prevalence in narratives of female versus male victims. Results provide a detailed quantitative picture of reporting about lethal violence and its gendered nature. Our method offers a flexible and broadly applicable approach to model topics in text data.
翻译:需要不断升级的方法来确定来自许多领域的文本数据的潜在模式。 我们引入了一种新的方法, 以在文体中识别主题, 并代表文件作为主题序列。 Dism Attom 模型利用理论机器学习的进展, 整合主题建模和文字嵌入, 利用每个领域的独特能力。 我们首先确定一组矢量(“ 分解原子 ” ), 提供嵌入空间的稀薄代表。 原子矢量可以被解释为潜在主题: 通过一种基因化模型, 将原子图广泛地放在文字上; 也可以推断产生一系列词序列的主题。 我们用一个未充分利用文本的突出例子来说明我们的方法: 美国国家暴力死亡报告系统(NVDRS) 。 NVDRS 以结构化变量和无结构化描述的形式总结暴力死亡事件。 我们在叙述中确定了225个潜在主题( 例如, 准备死亡和身体攻击); 许多这些主题没有被现有的结构化变量所捕捉到。 由已知的自杀和谋杀模式模式模式所激发, 以及最近关于性别偏见的研究在地震中提供了一种性别痛苦和性别偏见的理论。