In mixture modeling and clustering applications, the number of components and clusters is often not known. A stick-breaking mixture model, such as the Dirichlet process mixture model, is an appealing construction that assumes infinitely many components, while shrinking the weights of most of the unused components to near zero. However, it is well-known that this shrinkage is inadequate: even when the component distribution is correctly specified, spurious weights appear and give an inconsistent estimate of the number of clusters. In this article, we propose a simple solution: when breaking each mixture weight stick into two pieces, the length of the second piece is multiplied by a quasi-Bernoulli random variable, taking value one or a small constant close to zero. This effectively creates a soft-truncation and further shrinks the unused weights. Asymptotically, we show that as long as this small constant diminishes to zero at a rate faster than $o(1/n^2)$, with $n$ the sample size, the posterior distribution will converge to the true number of clusters. In comparison, we rigorously explore Dirichlet process mixture models using a concentration parameter that is either constant or rapidly diminishes to zero -- both of which lead to inconsistency for the number of clusters. Our proposed model is easy to implement, requiring only a small modification of a standard Gibbs sampler for mixture models. In simulations and a data application of clustering brain networks, our proposed method recovers the ground-truth number of clusters, and leads to a small number of clusters.
翻译:在混合建模和集群应用中,通常不知道部件和组群的数量。像Drichlet工艺混合模型这样的粘合混合物模型是一个令人兴奋的构造,它假定了无限多的成分,同时将大多数未使用的部件的重量缩到接近零。然而,众所周知,这种缩小是不充分的:即使部件的分布正确无误,但虚假的重量似乎会出现,对组群数量的估计也不一致。在本篇文章中,我们提出了一个简单的解决办法:当将每个混合物的重量分解成两块时,第二块的长度将用准-Bernoulli随机变量乘以一个或一个小的常数,使一个或一个小的常数接近零。这实际上造成了软调整,进一步缩小了大部分未使用的部件的重量。我们通常会显示,只要这个小的常数以比$(1N%2美元)更快的速度下降到零,那么当样品大小一美元时,第二块的长度就会与真正的组群体数相融合。相比之下,我们严格地探索Drichlet工艺的混合物模型模型,将一个小的模型使用一个或一个小的常数接近零的常数, 将一个常规的模型用于我们的拟议的模组群群群体的样本的模型,一个需要一个固定的模型,一个固定地,一个固定地,一个标准的组数将一个固定地,一个比一个基体的模型的模型的模型的模型的模型将一个固定地。