Markov chain Monte Carlo (MCMC) methods are often used in clustering since they guarantee asymptotically exact expectations in the infinite-time limit. In finite time, though, slow mixing often leads to poor performance. Modern computing environments offer massive parallelism, but naive implementations of parallel MCMC can exhibit substantial bias. In MCMC samplers of continuous random variables, Markov chain couplings can overcome bias. But these approaches depend crucially on paired chains meetings after a small number of transitions. We show that straightforward applications of existing coupling ideas to discrete clustering variables fail to meet quickly. This failure arises from the "label-switching problem": semantically equivalent cluster relabelings impede fast meeting of coupled chains. We instead consider chains as exploring the space of partitions rather than partitions' (arbitrary) labelings. Using a metric on the partition space, we formulate a practical algorithm using optimal transport couplings. Our theory confirms our method is accurate and efficient. In experiments ranging from clustering of genes or seeds to graph colorings, we show the benefits of our coupling in the highly parallel, time-limited regime.
翻译:Markov 链条 Monte Carlo (MCMCC) 方法常常被用于集群,因为它们保证了在无限时间限度内对离散集变体的简单准确期望。 但是,在有限的时间里,缓慢混合往往导致业绩不佳。 现代计算环境提供了巨大的平行, 但平行的MCMC的幼稚执行却显示出巨大的偏差。 在连续随机变量的MC抽样中, Markov 链条混合可以克服偏差。 但是, 这些方法关键地取决于在少数过渡之后对齐链会议。 我们的理论证明, 现有混合想法对离散集变体的简单应用无法迅速满足。 在“ 标签转换问题” 中, 这种失败产生: 等同的集束重新标签阻碍着连锁的快速相交接。 我们把链视为探索分割空间而不是分区( 任意的) 标签。 在分区空间上, 我们使用一种标准, 使用最佳的运输组合来制定实用的算法。 我们的理论证实了我们的方法是准确和有效率的。 在从基因或种子的组合到绘图的实验中, 我们展示了我们在高度平行、 的周期制度中的结合的好处。