Clustering methods have led to a number of important discoveries in bioinformatics and beyond. A major challenge in their use is determining which clusters represent important underlying structure, as opposed to spurious sampling artifacts. This challenge is especially serious, and very few methods are available, when the data are very high in dimension. Statistical Significance of Clustering (SigClust) is a recently developed cluster evaluation tool for high dimensional low sample size data. An important component of the SigClust approach is the very definition of a single cluster as a subset of data sampled from a multivariate Gaussian distribution. The implementation of SigClust requires the estimation of the eigenvalues of the covariance matrix for the null multivariate Gaussian distribution. We show that the original eigenvalue estimation can lead to a test that suffers from severe inflation of type-I error, in the important case where there are a few very large eigenvalues. This paper addresses this critical challenge using a novel likelihood based soft thresholding approach to estimate these eigenvalues, which leads to a much improved SigClust. Major improvements in SigClust performance are shown by both mathematical analysis, based on the new notion of Theoretical Cluster Index, and extensive simulation studies. Applications to some cancer genomic data further demonstrate the usefulness of these improvements.
翻译:集束方法导致在生物信息学和其他方面发现了一些重要的发现。 其使用的一个主要挑战是如何确定哪些组群代表重要的基本结构,而不是虚假的采样文物。 这一挑战特别严峻,当数据高度时,很少有方法可用。 集束(SigClust)的统计意义是最近开发的高维低样本大小数据的一个集束评价工具。 SigClust 方法的一个重要部分是将单一组群界定为从多变量高斯分布中抽取的一组数据。 实施SigClust 需要估算无效多变量高斯分布的共变矩阵的异性值。 我们表明,原组群集(SigClust)的估算可以导致一种测试,这种测试会受高维度低样本大小数据严重膨胀的影响。 本文用基于多变量高精度分布的软门槛方法来应对这一关键挑战。 实施SigClust要求估算这些易变值的组群集,这需要估算无效多变量分布的共变数矩阵矩阵矩阵矩阵矩阵的异性总值值。 我们通过对SigCroupal Clicust的精确性分析展示了这些新改进。