Data analysis plays an indispensable role for value creation in industry. Cluster analysis in this context is able to explore given datasets with little or no prior knowledge and to identify unknown patterns. As (big) data complexity increases in the dimensions volume, variety, and velocity, this becomes even more important. Many tools for cluster analysis have been developed from early on and the variety of different clustering algorithms is huge. As the selection of the right clustering procedure is crucial to the results of the data analysis, users are in need for support on their journey of extracting knowledge from raw data. Thus, the objective of this paper lies in the identification of a systematic selection logic for clustering algorithms and corresponding validation concepts. The goal is to enable potential users to choose an algorithm that fits best to their needs and the properties of their underlying data clustering problem. Moreover, users are supported in selecting the right validation concepts to make sense of the clustering results. Based on a comprehensive literature review, this paper provides assessment criteria for clustering method evaluation and validation concept selection. The criteria are applied to several common algorithms and the selection process of an algorithm is supported by the introduction of pseudocode-based routines that consider the underlying data structure.
翻译:数据分析在工业中创造价值方面发挥着不可或缺的作用。在这方面,集群分析能够以很少或没有以前的知识来探索特定的数据集,并查明未知的模式。随着(大)数据复杂性在维量、多样性和速度方面增加,这一点就变得更加重要。许多群集分析工具是早期开发的,而不同的群集算法是巨大的。由于选择正确的群集程序对于数据分析的结果至关重要,因此用户需要支持从原始数据中提取知识的旅程。因此,本文件的目标是确定组合算法和相应的验证概念的系统选择逻辑。目标是使潜在用户能够选择一种最符合其需求及其基本数据群集问题特性的算法。此外,用户在选择正确的验证概念以了解群集结果方面得到了支持。在综合文献审查的基础上,本文件为组合方法评价和验证概念的选择提供了评估标准。这些标准适用于几种通用算法和算法的选择过程,并辅之以考虑基础数据结构的假码例行程序。