Latent Dirichlet Allocation (LDA) is a topic model widely used in natural language processing and machine learning. Most approaches to training the model rely on iterative algorithms, which makes it difficult to run LDA on big corpora that are best analyzed in parallel and distributed computational environments. Indeed, current approaches to parallel inference either don't converge to the correct posterior or require storage of large dense matrices in memory. We present a novel sampler that overcomes both problems, and we show that this sampler is faster, both empirically and theoretically, than previous Gibbs samplers for LDA. We do so by employing a novel P\'olya-urn-based approximation in the sparse partially collapsed sampler for LDA. We prove that the approximation error vanishes with data size, making our algorithm asymptotically exact, a property of importance for large-scale topic models. In addition, we show, via an explicit example, that -- contrary to popular belief in the topic modeling literature -- partially collapsed samplers can be more efficient than fully collapsed samplers. We conclude by comparing the performance of our algorithm with that of other approaches on well-known corpora.
翻译:在自然语言处理和机器学习中广泛使用的一个专题模型。 培训模型的多数方法都依赖迭代算法, 这使得很难在平行和分布的计算环境中对大公司进行LDA, 而这些公司最好在平行和分布式的计算环境中进行分析。 事实上, 目前平行推论的方法要么不与正确的后部趋同, 要么要求在记忆中储存大量密集的矩阵。 我们提出了一个克服这两个问题的新型采样器, 并且我们表明, 这个采样器在经验上和理论上都比以前的Gibs采样器对LDA来说都快。 我们这样做是因为在分散的局部崩溃采样器中采用了新的 P\'olya- urn 近似法。 我们通过将我们算法的性能与其他众所周知的公司方法相比较, 来证明近似误差会随着数据大小的消失, 使我们的算法变得非常精确, 成为大型主题模型的重要属性。 此外, 我们通过一个明确的例子, 表明, 与人们在模型文献中的信念相反, 部分崩溃的采样器比完全崩溃的采样器效率更高。 我们的结论是, 我们通过比较我们的算法性能与其他著名的公司。