Recent advances in center-based clustering continue to improve upon the drawbacks of Lloyd's celebrated $k$-means algorithm over $60$ years after its introduction. Various methods seek to address poor local minima, sensitivity to outliers, and data that are not well-suited to Euclidean measures of fit, but many are supported largely empirically. Moreover, combining such approaches in a piecemeal manner can result in ad hoc methods, and the limited theoretical results supporting each individual contribution may no longer hold. Toward addressing these issues in a principled way, this paper proposes a cohesive robust framework for center-based clustering under a general class of dissimilarity measures. In particular, we present a rigorous theoretical treatment within a Median-of-Means (MoM) estimation framework, showing that it subsumes several popular $k$-means variants. In addition to unifying existing methods, we derive uniform concentration bounds that complete their analyses, and bridge these results to the MoM framework via Dudley's chaining arguments. Importantly, we neither require any assumptions on the distribution of the outlying observations nor on the relative number of observations $n$ to features $p$. We establish strong consistency and an error rate of $O(n^{-1/2})$ under mild conditions, surpassing the best-known results in the literature. The methods are empirically validated thoroughly on real and synthetic datasets.
翻译:以中心为主的集群最近的进展在劳埃德推出60年的美元汇率算法的缺点后继续得到改善。 各种方法都试图解决当地迷你差、对出口者敏感和数据不适于欧几里德的合适衡量标准,但许多数据在很大程度上得到了经验方面的支持。 此外,以零敲碎打的方式将这类方法结合起来,可能导致采用临时方法,而支持每项贡献的有限理论结果可能不再能够维持下去。为了以原则性的方式解决这些问题,本文件提议在一般的不相容措施类别下,为以中心为主的集群建立一个具有凝聚力的框架。 特别是,我们在中米(MMM)估算框架内提出了严格的理论处理方法,表明它包含几种受欢迎的美元汇率,但在很大程度上是用经验来支持的变式。 除了统一现有方法外,我们还得出了能够完成分析的统一的集中界限,并将这些结果通过Dudley的连锁论点与MMM框架联系起来。 关键是,我们并不要求任何假设关于基于外部观察结果的分布的一致性或关于相对数字的美元/美元观察结果的精确度。