In cancer research, clustering techniques are widely used for exploratory analyses and dimensionality reduction, playing a critical role in the identification of novel cancer subtypes, often with direct implications for patient management. As data collected by multiple research groups grows, it is increasingly feasible to investigate the replicability of clustering procedures, that is, their ability to consistently recover biologically meaningful clusters across several datasets. In this paper, we review existing methods to assess replicability of clustering analyses, and discuss a framework for evaluating cross-study clustering replicability, useful when two or more studies are available. These approaches can be applied to any clustering algorithm and can employ different measures of similarity between partitions to quantify replicability, globally (i.e. for the whole sample) as well as locally (i.e. for individual clusters). Using experiments on synthetic and real gene expression data, we illustrate the utility of replicability metrics to evaluate if the same clusters are identified consistently across a collection of datasets.
翻译:在癌症研究中,聚类技术被广泛用于探索性分析和减少维度,在确定新的癌症子类型方面发挥着关键作用,往往对病人管理产生直接影响。随着多个研究小组收集的数据的增多,调查聚类程序的可复制性,即它们能够持续地在多个数据集中回收具有生物意义的组群。在本文件中,我们审查了现有方法,以评估聚类分析的可复制性,并讨论了评估交叉研究集群可复制性的框架,在有两项或更多项研究时,这些方法非常有用。这些方法可以适用于任何聚类算法,并且可以采用不同程度的相似性测量方法,在全球(即整个样本)和当地(即单个组群群)量化可复制性。我们利用合成和真实基因表达数据的实验,说明了可复制性指标的效用,以评价是否在收集的数据集中一致地识别同一组群集。