Across many data domains, co-occurrence statistics about the joint appearance of objects are powerfully informative. By transforming unsupervised learning problems into decompositions of co-occurrence statistics, spectral algorithms provide transparent and efficient algorithms for posterior inference such as latent topic analysis and community detection. As object vocabularies grow, however, it becomes rapidly more expensive to store and run inference algorithms on co-occurrence statistics. Rectifying co-occurrence, the key process to uphold model assumptions, becomes increasingly more vital in the presence of rare terms, but current techniques cannot scale to large vocabularies. We propose novel methods that simultaneously compress and rectify co-occurrence statistics, scaling gracefully with the size of vocabulary and the dimension of latent space. We also present new algorithms learning latent variables from the compressed statistics, and verify that our methods perform comparably to previous approaches on both textual and non-textual data.
翻译:光谱算法通过将未经监督的学习问题转化为共同统计的分解,为潜在主题分析和社区探测等后推推法提供了透明有效的算法。然而,随着对象词汇的不断增长,存储和运行共同发生统计的推算法的成本迅速增加。校正共同发生,即维护模型假设的关键过程,在存在罕见条件的情况下变得越来越重要,但当前技术无法推广到大型词汇组。我们提出了同时压缩和纠正共同发生统计的新方法,与词汇的大小和潜在空间的维度相匹配。我们还介绍了从压缩统计数据中学习潜在变量的新算法,并核实我们的方法与以往的文字和非文字数据方法的兼容性。