以递退为基础的异异质分析,以确定高维数据中重叠的分组结构 (Regression-based heterogeneity analysis to identify overlapping subgroup structure in high-dimensional data)

Heterogeneity is a hallmark of complex diseases. Regression-based heterogeneity analysis, which is directly concerned with outcome-feature relationships, has led to a deeper understanding of disease biology. Such an analysis identifies the underlying subgroup structure and estimates the subgroup-specific regression coefficients. However, most of the existing regression-based heterogeneity analyses can only address disjoint subgroups; that is, each sample is assigned to only one subgroup. In reality, some samples have multiple labels, for example, many genes have several biological functions, and some cells of pure cell types transition into other types over time, which suggest that their outcome-feature relationships (regression coefficients) can be a mixture of relationships in more than one subgroups, and as a result, the disjoint subgrouping results can be unsatisfactory. To this end, we develop a novel approach to regression-based heterogeneity analysis, which takes into account possible overlaps between subgroups and high data dimensions. A subgroup membership vector is introduced for each sample, which is combined with a loss function. Considering the lack of information arising from small sample sizes, an $l_2$ norm penalty is developed for each membership vector to encourage similarity in its elements. A sparse penalization is also applied for regularized estimation and feature selection. Extensive simulations demonstrate its superiority over direct competitors. The analysis of Cancer Cell Line Encyclopedia data and lung cancer data from The Cancer Genome Atlas shows that the proposed approach can identify an overlapping subgroup structure with favorable performance in prediction and stability.

翻译：偏差性是复杂疾病的特征。回归偏差分析直接与结果- 性能关系直接相关,导致对疾病生物学的更深入理解。这种分析确定了基础分组结构,并估算了子分组特有的回归回归回归回归回归系数。然而,现有的多数回归偏差分析只能解决脱节分组问题;也就是说,每个样本只分配到一个分组。在现实中,一些样本有多重标签,例如,许多基因具有若干生物功能,而一些纯细胞类型向其他类型的转变的细胞,表明其结果-性能关系(递减系数)可以混合一个以上的分组的关系,并因此,脱节的分组结果结果可能会不令人满意。为此,我们制定了基于回归的偏差性分析的新办法,其中考虑到子分组和高数据层面之间的可能重叠。对每一种样本采用分组成员矢量,同时引入了损失函数。考虑到从小样本规模得出的数据-周期性关系(递减系数)中缺乏信息,对定期数据序列的稳定性和递归值分析也显示其直值的稳定性。