We address general-shaped clustering problems under very weak parametric assumptions with a two-step hybrid robust clustering algorithm based on trimmed k-means and hierarchical agglomeration. The algorithm has low computational complexity and effectively identifies the clusters also in presence of data contamination. We also present natural generalizations of the approach as well as an adaptive procedure to estimate the amount of contamination in a data-driven fashion. Our proposal outperforms state-of-the-art robust, model-based methods in our numerical simulations and real-world applications related to color quantization for image analysis, human mobility patterns based on GPS data, biomedical images of diabetic retinopathy, and functional data across weather stations.
翻译:在非常薄弱的参数假设下,我们处理一般形群集问题,采用基于三重K值和分层聚集的两步混合稳健组合算法。算法的计算复杂性低,在数据污染的情况下有效识别群集。我们还介绍了这种方法的自然概括性,以及以数据驱动方式估计污染程度的适应性程序。我们的建议在数字模拟和与图像分析的彩色量化、基于全球定位系统数据的人的流动模式、糖尿病复古生物图象和跨气象站功能数据等现实世界应用中优于最新、稳健、基于模型的方法。