Topic models analyze text from a set of documents. Documents are modeled as a mixture of topics, with topics defined as probability distributions on words. Inferences of interest include the most probable topics and characterization of a topic by inspecting the topic's highest probability words. Motivated by a data set of web pages (documents) nested in web sites, we extend the Poisson factor analysis topic model to hierarchical topic presence models for analyzing text from documents nested in known groups. We incorporate an unknown binary topic presence parameter for each topic at the web site and/or the web page level to allow web sites and/or web pages to be sparse mixtures of topics and we propose logistic regression modeling of topic presence conditional on web site covariates. We introduce local topics into the Poisson factor analysis framework, where each web site has a local topic not found in other web sites. Two data augmentation methods, the Chinese table distribution and P\'{o}lya-Gamma augmentation, aid in constructing our sampler. We analyze text from web pages nested in United States local public health department web sites to abstract topical information and understand national patterns in topic presence.
翻译:分析一组文件文本的专题模型; 文件是作为一组专题的混合体建模的,题目的定义是文字的概率分布; 引人注意的推论包括最可能的专题和通过检查专题的概率最高词对专题的定性; 受一组在网站上嵌入的网页(文件)数据集的驱动,我们把Poisson要素分析专题模型扩大到从已知群体嵌入的文件文本分析的分级专题存在模式; 我们为网站和(或)网页的每个专题增加了一个未知的二进制主题存在参数,使网站和(或)网页能够分散各种专题的组合; 我们建议以网站变量为条件,对专题存在进行后勤回归模型; 我们把地方专题引入Poisson要素分析框架, 在每个网站都有其他网站没有找到的本地专题; 两种数据增强方法,即中国表格分布和P\{o}lya-Gamma加增, 帮助构建我们的取样器。 我们分析了美国公共卫生部网站嵌入网页的文本,以摘要主题信息为基础,并理解专题存在的国家模式。