The probabilistic topic model imposes a low-rank structure on the expectation of the corpus matrix. Therefore, singular value decomposition (SVD) is a natural tool of dimension reduction. We propose an SVD-based method for estimating a topic model. Our method constructs an estimate of the topic matrix from only a few leading singular vectors of the corpus matrix, and has a great advantage in memory use and computational cost for large-scale corpora. The core ideas behind our method include a pre-SVD normalization to tackle severe word frequency heterogeneity, a post-SVD normalization to create a low-dimensional word embedding that manifests a simplex geometry, and a post-SVD procedure to construct an estimate of the topic matrix directly from the embedded word cloud. We provide the explicit rate of convergence of our method. We show that our method attains the optimal rate in the case of long and moderately long documents, and it improves the rates of existing methods in the case of short documents. The key of our analysis is a sharp row-wise large-deviation bound for empirical singular vectors, which is technically demanding to derive and potentially useful for other problems. We apply our method to a corpus of Associated Press news articles and a corpus of abstracts of statistical papers.
翻译:因此,单值分解(SVD)是减少维度的自然工具。我们建议采用基于SVD的模型来估计一个专题模型。我们的方法只从本体矩阵的几个主要单一矢量中估算了专题矩阵,在大型公司使用的记忆使用和计算成本方面有很大优势。我们方法的核心思想包括:SVD前的正常化,以解决严重字频异性、SVD后正常化,以创建低维字嵌入显示简单x几何法的低维字,以及SVD后程序,直接从嵌入的云层中估算专题矩阵。我们提供了我们方法的明确趋同率。我们表明,我们的方法在长中和中长文件的记忆使用和计算成本方面达到了最佳率,在短文件的情况下提高了现有方法的速率。我们分析的关键是直径大缩缩放,用于实验性单位矢量图像,从技术上要求我们的数据序列中采用其他统计工具。