Clustering in high-dimensions poses many statistical challenges. While traditional distance-based clustering methods are computationally feasible, they lack probabilistic interpretation and rely on heuristics for estimation of the number of clusters. On the other hand, probabilistic model-based clustering techniques often fail to scale and devising algorithms that are able to effectively explore the posterior space is an open problem. Based on recent developments in Bayesian distance-based clustering, we propose a hybrid solution that entails defining a likelihood on pairwise distances between observations. The novelty of the approach consists in including both cohesion and repulsion terms in the likelihood, which allows for cluster identifiability. This implies that clusters are composed of objects which have small "dissimilarities" among themselves (cohesion) and similar dissimilarities to observations in other clusters (repulsion). We show how this modelling strategy has interesting connection with existing proposals in the literature as well as a decision-theoretic interpretation. The proposed method is computationally efficient and applicable to a wide variety of scenarios. We demonstrate the approach in a simulation study and an application in digital numismatics.
翻译:高维聚类存在许多统计方面的挑战。然而,传统的基于距离的聚类方法虽然计算可行,但缺乏概率解释,而且对于聚类数量的估计依赖于启发式方法。另一方面,基于概率模型的聚类技术往往难以扩展,并且设计能够有效地探索后验空间的算法是一个未解决的问题。基于贝叶斯距离聚类的最新发展,我们提出了一种混合解决方案,其中涉及对观测间的配对距离定义似然函数。这种方法的新颖之处在于在似然函数中同时包含凝聚和排斥项,这允许聚类可识别性。这意味着聚类由彼此相似(凝聚)且与其他聚类中的观测物相似(排斥)的对象组成。我们展示了这个建模策略与文献中现有的提案之间的有趣联系,以及与决策论解释的关联。该方法计算效率高,适用于各种情况。我们通过模拟研究和数字纪念学应用来展示这种方法。