Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To judge the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to validating and replicating a clustering result using a validation dataset. Such a dataset may be part of the original dataset, which is separated before analysis begins, or it could be an independently collected dataset. We present a systematic structured framework for validating clustering results on validation data that includes most existing validation approaches. In particular, we review classical validation techniques such as internal and external validation, stability analysis, hypothesis testing, and visual validation, and show how they can be interpreted in terms of our framework. We precisely define and formalise different types of validation of clustering results on a validation dataset and explain how each type can be implemented in practice. Furthermore, we give examples of how clustering studies from the applied literature that used a validation dataset can be classified into the framework.
翻译:集群分析是指一系列广泛的分类发现数据分析技术,在许多应用领域很受欢迎。为判断集群结果的质量,文献中提出了不同的群集验证程序。虽然在传统验证技术方面做了大量工作,例如内部和外部验证,但较少注意使用验证数据集验证和复制集群结果。这种数据集可能是原始数据集的一部分,该数据集在分析开始之前是分开的,也可能是独立收集的数据集。我们提出了一个系统化的结构化框架,用以验证包括大多数现有验证方法在内的验证数据组合结果。我们特别审查了传统的验证技术,例如内部和外部验证、稳定性分析、假设测试和视觉验证,并表明如何用框架来解释这些技术。我们精确地界定和正式确定验证数据集上分类结果的不同类型,并解释如何在实践中执行每种类型。此外,我们举例说明如何将使用验证数据集的应用文献的集群研究分类为框架。