Robustly determining the optimal number of clusters in a data set is an essential factor in a wide range of applications. Cluster enumeration becomes challenging when the true underlying structure in the observed data is corrupted by heavy-tailed noise and outliers. Recently, Bayesian cluster enumeration criteria have been derived by formulating cluster enumeration as maximization of the posterior probability of candidate models. This article generalizes robust Bayesian cluster enumeration so that it can be used with any arbitrary Real Elliptically Symmetric (RES) distributed mixture model. Our framework also covers the case of M-estimators that allow for mixture models, which are decoupled from a specific probability distribution. Examples of Huber's and Tukey's M-estimators are discussed. We derive a robust criterion for data sets with finite sample size, and also provide an asymptotic approximation to reduce the computational cost at large sample sizes. The algorithms are applied to simulated and real-world data sets, including radar-based person identification, and show a significant robustness improvement in comparison to existing methods.
翻译:严格确定数据集中的最佳组群数是一系列广泛应用中的基本因素。当观测数据的真正基本结构被重尾噪音和外缘破坏时,群集点点就会变得具有挑战性。最近,贝叶西亚群集点点点标准是通过制定群集点点来得出,以最大限度地增加候选模型的后方概率。本文章概括了强大的巴伊西亚群集点点点点数,以便能够用于任何任意的真 Elliptical Symit(RES) 分布式混合物模型。我们的框架还涵盖了允许混合模型的M-Sestimators的例子,这些模型与具体的概率分布脱钩。讨论了Huber 和 Tukey 的M- simesers 标点数的例子。我们为抽样规模有限的数据集制定了一个强有力的标准,并且提供了一种无症状的近似值,以降低大样本规模的计算成本。算法用于模拟和真实世界数据集,包括基于雷达的人识别,并显示与现有方法相比的稳健性显著提高。