Topic models provide a useful text-mining tool for learning, extracting, and discovering latent structures in large text corpora. Although a plethora of methods have been proposed for topic modeling, lacking in the literature is a formal theoretical investigation of the statistical identifiability and accuracy of latent topic estimation. In this paper, we propose a maximum likelihood estimator (MLE) of latent topics based on a specific integrated likelihood that is naturally connected to the concept, in computational geometry, of volume minimization. Our theory introduces a new set of geometric conditions for topic model identifiability, conditions that are weaker than conventional separability conditions, which typically rely on the existence of pure topic documents or of anchor words. Weaker conditions allow a wider and thus potentially more fruitful investigation. We conduct finite-sample error analysis for the proposed estimator and discuss connections between our results and those of previous investigations. We conclude with empirical studies employing both simulated and real datasets.
翻译:专题模型为在大型文本公司中学习、提取和发现潜在结构提供了有用的文字挖掘工具。虽然为专题建模提出了许多方法,但文献中缺乏的是对潜在专题估计的统计可识别性和准确性的正式理论调查。在本文件中,我们提议根据一个与量化最小化的计算几何概念自然相连的具体综合可能性,对潜在专题进行最大可能性的估计。我们的理论为专题模型可识别性提出了一套新的几何条件,这些条件比常规的可识别性条件弱,通常依赖纯主题文件或固定词的存在。微弱的条件使得可以进行更广泛、因而可能更有成效的调查。我们为拟议的估计数据进行有限地抽样分析,并讨论我们的结果与以往调查的结果之间的联系。我们最后采用模拟和真实的数据集进行实证研究。