We introduce an unsupervised learning approach that combines the truncated singular value decomposition with convex clustering to estimate within-cluster directions of maximum variance/covariance (in the variables) while simultaneously hierarchically clustering (on observations). In contrast to previous work on joint clustering and embedding, our approach has a straightforward formulation, is readily scalable via distributed optimization, and admits a direct interpretation as hierarchically clustered principal component analysis (PCA), hierarchically clustered locally linear embedding (LLE), or hierarchically clustered canonical correlation analysis (CCA). Through numerical experiments and real-world examples relevant to precision medicine, we show that our approach outperforms traditional and contemporary clustering methods on both underdetermined problems ($p \gg N$ with tens of observations) and on large datasets (e.g., $N=100,000$) while yielding interpretable dendrograms of hierarchical per-cluster principal components or canonical variates.
翻译:我们引入了一种不受监督的学习方法,将缺漏的单值分解与混凝土组合组合结合起来,以估计最大差异/差异(变量中)的分组内部方向,同时进行分级分组(观察 ) 。 与以往关于联合集群和嵌入的工作相比,我们的方法有一个直截了当的配方,通过分配优化很容易伸缩,并承认直接解释为分级组合主要组成部分分析(PCA),按等级分组的本地线性嵌入(LLLE),或按等级分组的混凝土关联分析(CCA ) 。 通过与精密医学相关的数字实验和真实世界实例,我们表明我们的方法在以下两方面都超过了传统和当代的分组方法:未确定的问题(在十次观测中为 gg N$ ) ) 和大型数据集(例如,$N=100,000美元 ),同时产生可解释的每组集主要组件或罐体变值的分类。