An appropriate distance metric is crucial for categorical data clustering, as the distance between categorical data cannot be directly calculated. However, the distances between attribute values usually vary in different clusters induced by their different distributions, which has not been taken into account, thus leading to unreasonable distance measurement. Therefore, we propose a cluster-customized distance metric for categorical data clustering, which can competitively update distances based on different distributions of attributes in each cluster. In addition, we extend the proposed distance metric to the mixed data that contains both numerical and categorical attributes. Experiments demonstrate the efficacy of the proposed method, i.e., achieving an average ranking of around first in fourteen datasets. The source code is available at https://anonymous.4open.science/r/CADM-47D8
翻译:合适的距离度量对于分类数据聚类至关重要,因为分类数据之间的距离无法直接计算。然而,属性值之间的距离通常因不同簇内分布差异而变化,现有方法未考虑这一因素,导致距离度量不合理。为此,我们提出一种面向分类数据聚类的簇定制距离度量方法,能够根据各簇内属性的不同分布动态调整距离计算。此外,我们将该距离度量扩展至同时包含数值属性和分类属性的混合数据。实验结果表明,所提方法在十四个数据集上平均排名接近首位,验证了其有效性。源代码发布于 https://anonymous.4open.science/r/CADM-47D8。