This paper presents a clustering technique that reduces the susceptibility to data noise by learning and clustering the data-distribution and then assigning the data to the cluster of its distribution and, in the process, reducing the impact of noise on clustering results. This method involves introducing a new distance among distributions, namely the expectation distance (denoted, ED), that goes beyond the state-of-art distribution distance of optimal mass transport (denoted, $W_2$ for $2$-Wasserstein): The latter essentially depends only on the marginal distributions while the former also employs the information about the joint distributions. Using the ED, the paper extends the classical $K$-means and $K$-medoids clustering to those over data-distributions (rather raw data) and introduces $K$-medoids using $W_2$. The paper also presents the closed-form expressions of the ED distance measure for the case when the uncertainty is Gaussian. The implementation results of the proposed ED and the $W_2$ distance measures to cluster real-world weather data are also presented, which involves efficiently extracting and using underlying uncertainty information in the form of means and variances (that, for example, is adequate to characterize Gaussian distributions). The results show striking performance improvement over classical clustering of raw data, with higher accuracy realized for ED. This is because while $W_2$ employs only the marginal distributions ignoring the correlations, the proposed ED also uses the joint distributions factoring the correlations into the distance measures.
翻译:本文介绍了一种集群技术,通过学习和分组数据分布,降低对数据噪音的敏感度,然后将数据分配给其分布组群,并在此过程中减少噪音对分组结果的影响。这一方法涉及在分布间引入新的距离,即预期距离(注意,ED),超过最佳大众运输最先进的分布距离(注意,2美元为2美元为W2美元-Wasserstein):后者主要取决于边际分布,而前者也使用关于联合分布的信息。利用ED,文件将典型的美元比值平均值和美元比值组组群对数据结果的比值扩大至数据分配过量的距离(而非原始数据),并采用美元比值2美元(美元为2美元)的美元比值引入了美元比值值。 本文还介绍了在不确定性为高的情况下ED距离测量的封闭式表达方式。 拟议的ED和美元对数据分组实际天气数据的距离测量结果也是以美元表示的,这需要高效地提取比值的比值,同时用精确的比值数据显示比值的比值,因为Siralalalalal 数据使用比值的比值显示数据的比值的分布的比值,而显示数据的精确度的比值是比值的比值的比值的比值,因为数据的比值的比值是比值的比值的比值的比值的比值的比值的比值,因为比值的分布的比值是比值的比值是比值是比值,因此的比值是比值的比值是比值,因此显示的比值值的比值的比值值的比值的比值的比值,因为比值的比值的比值的比值的比值的比值是比值是比值是比值的比值的比值的比值的比值是比值的比值是比值的比值的比值。