This paper presents a clustering technique that reduces the susceptibility to data noise by learning and clustering the data-distribution and then assigning the data to the cluster of its distribution. In the process, it reduces the impact of noise on clustering results. This method involves introducing a new distance among distributions, namely the expectation distance (denoted, ED), that goes beyond the state-of-art distribution distance of optimal mass transport (denoted, $W_2$ for $2$-Wasserstein): The latter essentially depends only on the marginal distributions while the former also employs the information about the joint distributions. Using the ED, the paper extends the classical $K$-means and $K$-medoids clustering to those over data-distributions (rather than raw-data) and introduces $K$-medoids using $W_2$. The paper also presents the closed-form expressions of the $W_2$ and ED distance measures. The implementation results of the proposed ED and the $W_2$ distance measures to cluster real-world weather data as well as stock data are also presented, which involves efficiently extracting and using the underlying data distributions -- Gaussians for weather data versus lognormals for stock data. The results show striking performance improvement over classical clustering of raw-data, with higher accuracy realized for ED. Also, not only does the distribution-based clustering offer higher accuracy, but it also lowers the computation time due to reduced time-complexity.
翻译:本文介绍了一种集群技术,通过学习和分组数据分布,降低对数据噪音的敏感度,然后将数据分配给分布组群。在这一过程中,它减少了噪音对分组结果的影响。这种方法涉及在分布中引入新的距离,即预期距离(注意,ED),这超出了最佳大众运输的最先进分布距离(注,W_2美元为$-Wasserstein2美元),后者主要取决于边际分布,而前者也使用关于联合分布的信息。在使用ED时,该文件将经典美元汇率和美元汇率组群对数据分配的影响扩大到数据分配超标值(而不是原始数据),并采用美元-美元-美元,超过最佳大众运输量(W_2美元),引入美元-美元-Wserstein措施的封闭式表达方式。拟议的ED和美元-2美元(美元)的距离测量结果对数据分组实际天气数据的分配结果,以及存量数据数据数据数据也显示,这需要以更高的时间来有效提取和精确性数据,同时用正态数据分析数据显示比正常的分类的准确性数据,用来显示数据流流压数据,并且显示对数据进行更精确的计算。</s>