This paper presents noise-robust clustering techniques in unsupervised machine learning. The uncertainty about the noise, consistency, and other ambiguities can become severe obstacles in data analytics. As a result, data quality, cleansing, management, and governance remain critical disciplines when working with Big Data. With this complexity, it is no longer sufficient to treat data deterministically as in a classical setting, and it becomes meaningful to account for noise distribution and its impact on data sample values. Classical clustering methods group data into "similarity classes" depending on their relative distances or similarities in the underlying space. This paper addressed this problem via the extension of classical $K$-means and $K$-medoids clustering over data distributions (rather than the raw data). This involves measuring distances among distributions using two types of measures: the optimal mass transport (also called Wasserstein distance, denoted $W_2$) and a novel distance measure proposed in this paper, the expected value of random variable distance (denoted ED). The presented distribution-based $K$-means and $K$-medoids algorithms cluster the data distributions first and then assign each raw data to the cluster of data's distribution.
翻译:本文介绍了在不受监督的机器学习过程中的噪音- 有机热聚变技术。 噪音、 一致性和其他模糊性方面的不确定性可能成为数据分析中的严重障碍。 因此, 数据质量、 清理、 管理和治理仍然是与大数据合作的关键学科。 如此复杂, 不再足以像古典环境那样对数据进行决定性的处理, 也不足以考虑噪音分布及其对数据样本值的影响 。 经典集束方法将数据分组数据分为“ 相近类 ”, 取决于其相对距离或基础空间的相似性 。 本文通过经典 $ 平均值和 $ 美元 类集成数据分布( 而不是原始数据 ) 来解决这个问题 。 这涉及使用两种措施衡量分布之间的距离: 最佳大众运输( 也称为 瓦塞斯坦 距离, 注意 $_ 2 ), 以及本文中提议的新的距离测量标准, 随机可变距离的预期值( 注意 ED ) 。 本文通过基于 发行的 $ K 比例 和 $ $ 美元 基 基 类组数 数据 分配数据 向原始数据分组 分配 。