Clustering in high-dimensions poses many statistical challenges. While traditional distance-based clustering methods are computationally feasible, they lack probabilistic interpretation and rely on heuristics for estimation of the number of clusters. On the other hand, probabilistic model-based clustering techniques often fail to scale and devising algorithms that are able to effectively explore the posterior space is an open problem. Based on recent developments in Bayesian distance-based clustering, we propose a hybrid solution that entails defining a likelihood on pairwise distances between observations. The novelty of the approach consists in including both cohesion and repulsion terms in the likelihood, which allows for cluster identifiability. This implies that clusters are composed of objects which have small "dissimilarities" among themselves (cohesion) and similar dissimilarities to observations in other clusters (repulsion). We show how this modelling strategy has interesting connection with existing proposals in the literature as well as a decision-theoretic interpretation. The proposed method is computationally efficient and applicable to a wide variety of scenarios. We demonstrate the approach in a simulation study and an application in digital numismatics.
翻译:虽然传统的基于远程的集群方法在计算上是可行的,但它们缺乏概率解释,在估计集群数量时依赖超常性来估计。另一方面,基于模型的集群方法往往没有规模化和设计能够有效探索后方空间的算法,这是一个尚未解决的问题。根据巴伊西亚远程集群的最新发展,我们提出了一个混合解决方案,其中要求确定观测之间对称距离的可能性。该方法的新颖性在于在可能性中包括凝聚力和反向术语,从而允许群集识别性。这意味着集群由彼此之间“差异”小的物体组成(组合),与其他群中观测(反响)的类似差异组成。我们展示了这一建模战略如何与文献中的现有建议以及决策理论解释有有趣的联系。拟议方法具有计算效率,并适用于各种各样的情景。我们在模拟研究中展示了该方法,并在数字纳米学中应用。