$k$-means algorithm is one of the most classical clustering methods, which has been widely and successfully used in signal processing. However, due to the thin-tailed property of the Gaussian distribution, $k$-means algorithm suffers from relatively poor performance on the dataset containing heavy-tailed data or outliers. Besides, standard $k$-means algorithm also has relatively weak stability, $i.e.$ its results have a large variance, which reduces its credibility. In this paper, we propose a robust and stable $k$-means variant, dubbed the $t$-$k$-means, as well as its fast version to alleviate those problems. Theoretically, we derive the $t$-$k$-means and analyze its robustness and stability from the aspect of the loss function and the expression of the clustering center, respectively. Extensive experiments are also conducted, which verify the effectiveness and efficiency of the proposed method. The code for reproducing main results is available at \url{https://github.com/THUYimingLi/t-k-means}.
翻译:以美元计价的算法是最典型的组合法之一,在信号处理中广泛和成功地使用了这种方法,然而,由于高山分布的细尾特性,以美元计价的算法在包含重尾数据或离线数据的数据集上表现较差,而且标准以美元计价的算法也相对不稳定,其结果也有很大差异,从而降低了其可信度。在本文件中,我们提出一个坚固和稳定的以美元计价的变方,称为美元-千元,以及其快速版本来缓解这些问题。理论上,我们从损失函数和组合中心的表现中分别得出美元-千元值的算法并分析其稳健性和稳定性。还进行了广泛的实验,以核实拟议方法的有效性和效率。