Hierarchical clustering is one of the standard methods taught for identifying and exploring the underlying structures that may be present within a data set. Students are shown examples in which the dendrogram, a visual representation of the hierarchical clustering, reveals a clear clustering structure. However, in practice, data analysts today frequently encounter data sets whose large scale undermines the usefulness of the dendrogram as a visualization tool. Densely packed branches obscure structure, and overlapping labels are impossible to read. In this paper we present a new workflow for performing hierarchical clustering via the R package called protoshiny that aims to restore hierarchical clustering to its former role of being an effective and versatile visualization tool. Our proposal leverages interactivity combined with the ability to label internal nodes in a dendrogram with a representative data point (called a prototype). After presenting the workflow, we provide three case studies to demonstrate its utility.
翻译:分层集群是用于确定和探索数据集中可能存在的基本结构的标准方法之一; 向学生展示了登地格阵列,即分层集群的直观表示,揭示出清晰的集群结构; 然而,在实践中,数据分析家今天经常遇到数据集,其大规模破坏登地格阵作为可视化工具的效用的数据集; 大量包装的支流结构模糊不清,而且标签重叠,无法阅读。 在本文中,我们提出了一个新的工作流程,用于通过称为先质的R软件包进行分层集群。 该软件包的目的是将分层集群恢复到以前有效、多功能的可视化工具的作用。 我们的提议利用互动性,加上在登地格阵列上以具有代表性的数据点(称为原型)为内部节点贴标签的能力。 在介绍工作流程后,我们提供了三个案例研究,以展示其效用。