We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates the number of topics K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any number of documents (n), individual document length (N_i), dictionary size (p) and number of topics (K), and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.
翻译:我们提出新的专题模型估算方法,这不是对现有简单分析算法的变异,而是从观察到的数据中估算主题K的数量。我们从A的估计中得出新的有限样本小范围最小值下限,以及我们提议的估计天花板的新的上限。我们描述了我们的估计值是小范围适应的情景。我们有限的抽样分析对任何文件数量(n)、单个文件长度(N_i)、字典大小(p)和专题数量(K)都有效,并且允许使用n来增加p和K,而以前的分析没有很好地处理这种情况。我们用详细的模拟研究来补充我们的理论结果。我们用详细的模拟研究来补充我们的理论结果。我们说明,新的算法比目前的算法更快、更准确,尽管我们开始时的计算和理论缺点是不知道主题K的正确数量,但我们在模拟中提供了相互竞争的方法的正确价值。