Clustering is commonly performed as an initial analysis step for uncovering structure in 'omics datasets, e.g. to discover molecular subtypes of disease. The high-throughput, high-dimensional nature of these datasets means that they provide information on a diverse array of different biomolecular processes and pathways. Different groups of variables (e.g. genes or proteins) will be implicated in different biomolecular processes, and hence undertaking analyses that are limited to identifying just a single clustering partition of the whole dataset is therefore liable to conflate the multiple clustering structures that may arise from these distinct processes. To address this, we propose a multi-view Bayesian mixture model that identifies groups of variables (``views"), each of which defines a distinct clustering structure. We consider applications in stratified medicine, for which our principal goal is to identify clusters of patients that define distinct, clinically actionable disease subtypes. We adopt the semi-supervised, outcome-guided mixture modelling approach of Bayesian profile regression that makes use of a response variable in order to guide inference toward the clusterings that are most relevant in a stratified medicine context. We present the model, together with illustrative simulation examples, and examples from pan-cancer proteomics. We demonstrate how the approach can be used to perform integrative clustering, and consider an example in which different 'omics datasets are integrated in the context of breast cancer subtyping.
翻译:群集通常作为初步分析步骤进行,以发现“组群”数据集中的结构,例如发现疾病分子子型的分子子类型。这些数据集的高通量、高维性质意味着它们提供关于不同生物分子过程和途径的各种信息。不同的变数组(如基因或蛋白质)将涉及不同的生物分子过程,因此,进行分析,仅限于确定整个数据集的单一群集分布,因此有可能将这些不同过程可能产生的多重群集结构混为一体。为了解决这个问题,我们提议一个多视图的巴耶斯混合模型,确定变量组(“视图”),每个变量组都界定不同的组群结构。我们考虑分层医学中的应用,我们的主要目标是确定界定独特、临床可采取行动的疾病子型的病人群集。我们采用半超超、结果制导的混合物建模组合模型方法,利用反应变量来指导当前各组群集(“视图”)、每个变量组群群(“观察”),每个组群集都界定不同的组群集结构,我们用了一个模型来展示当前各组群集中最相关的范例。</s>