In mixture modeling and clustering applications, the number of components and clusters is often not known. A stick-breaking mixture model, such as the Dirichlet process mixture model, is an appealing construction that assumes infinitely many components, while shrinking the weights of most of the unused components to near zero. However, it is well-known that this shrinkage is inadequate: even when the component distribution is correctly specified, spurious weights appear and give an inconsistent estimate of the number of clusters. In this article, we propose a simple solution: when breaking each mixture weight stick into two pieces, the length of the second piece is multiplied by a quasi-Bernoulli random variable, taking value one or a small constant close to zero. This effectively creates a soft-truncation and further shrinks the unused weights. Asymptotically, we show that as long as this small constant diminishes to zero fast enough, the posterior distribution will converge to the true number of clusters. In comparison, we rigorously explore Dirichlet process mixture models using a concentration parameter that is either constant or rapidly diminishes to zero -- both of which lead to inconsistency for the number of clusters. Our proposed model is easy to implement, requiring only a small modification of a standard Gibbs sampler for mixture models. Empirically, the proposed method exhibits superior performance in simulations and a data application in clustering brain networks.
翻译:在混合建模和集群应用中,通常不知道部件和组群的数量。像Drichlet工艺混合模型这样的粘合混合物模型是一个令人兴奋的构造,它假定了无限多的成分,同时将大多数未使用的部件的重量缩到接近零。然而,众所周知,这种缩小是不充分的:即使部件的分布正确指定,但表面重量似乎并不尽如人意,对组群数量的估计也不一致。在本篇文章中,我们提出了一个简单的解决办法:当将每个混合物的重量分为两块时,第二块的长度会乘以准-Bernoulli随机变量,将值乘以一个或一个小的常数接近零。这实际上造成了软调整,进一步缩小了大部分未使用的部件的重量。我们不言而论地表明,只要这种小的常数减少到零,后方的分布就会集中到真正的组群数。相比之下,我们严格地探索Drichlet混合物模型,使用浓度参数,要么是固定的,要么迅速减为零的随机变量,将第二块的长度乘以一个准的半伯诺利随机变量,将一个或一个小的常数常数乘以接近零的常数。这有效地创造了一个软调,从而进一步进一步了未用一个模拟模型来计算模型。我们提出的标准的模型的模型的模型的模模模模模模模模模模模模模模模模模模数。我们建议的模的模型的模型的模型的模模。