We introduce an unsupervised learning approach that combines the truncated singular value decomposition with convex clustering to estimate within-cluster directions of maximum variance/covariance (in the variables) while simultaneously hierarchically clustering (on observations). In contrast to previous work on joint clustering and embedding, our approach has a straightforward formulation, is readily scalable via distributed optimization, and admits a direct interpretation as hierarchically clustered principal component analysis (PCA) or hierarchically clustered canonical correlation analysis (CCA). Through numerical experiments and real-world examples relevant to precision medicine, we show that our approach outperforms traditional and contemporary clustering methods on underdetermined problems ($p \gg N$ with tens of observations) and scales to large datasets (e.g., $N=100,000$; $p=1,000$) while yielding interpretable dendrograms of hierarchical per-cluster principal components or canonical variates.
翻译:我们引入了一种不受监督的学习方法,将缺漏的单值分解与混凝土组合组合相结合,以估计组内最大差异/差异(变量中)的最大差异/差异(变量中)的方向,同时按等级分组(观察 ) 。 与以往关于联合组合和嵌入的工作相比,我们的方法有一个直截了当的配方,通过分配优化很容易伸缩,并承认直接解释为按等级分组的主要成分分析(PCA)或按等级分组的骨干关联分析(CCA ) 。 通过与精密医学相关的数字实验和真实世界实例,我们表明我们的方法在确定的问题上比传统和当代的组群方法(p\gg N$,加上数十个观测)和大数据集尺度(例如,10万美元=10万美元;1 000美元)都比得上得力,同时产生可解释的每个组各等级主要成分或罐体变数的曲线。