Revealing relationships between genes and disease phenotypes is a critical problem in biomedical studies. This problem has been challenged by the heterogeneity of diseases. Patients of a perceived same disease may form multiple subgroups, and different subgroups have distinct sets of important genes. It is hence imperative to discover the latent subgroups and reveal the subgroup-specific important genes. Some heterogeneity analysis methods have been proposed in recent literature. Despite considerable successes, most of the existing studies are still limited as they cannot accommodate data contamination and ignore the interconnections among genes. Aiming at these shortages, we develop a robust structured heterogeneity analysis approach to identify subgroups, select important genes as well as estimate their effects on the phenotype of interest. Possible data contamination is accommodated by employing the Huber loss function. A sparse overlapping group lasso penalty is imposed to conduct regularization estimation and gene identification, while taking into account the possibly overlapping cluster structure of genes. This approach takes an iterative strategy in the similar spirit of K-means clustering. Simulations demonstrate that the proposed approach outperforms alternatives in revealing the heterogeneity and selecting important genes for each subgroup. The analysis of Cancer Cell Line Encyclopedia data leads to biologically meaningful findings with improved prediction and grouping stability.
翻译:在生物医学研究中,基因和疾病苯菌型之间的再生关系是一个严重的问题。这个问题已经受到疾病异质性的挑战。认为同一疾病的患者可能形成多个分组,而不同的分组则有不同的重要基因组。因此,必须发现潜在的分组,并披露子分组特有的重要基因。最近文献中提出了一些异质性分析方法。尽管取得了相当大的成功,但大多数现有研究仍然有限,因为它们无法容纳数据污染和忽视基因之间的相互联系。为了应对这些短缺,我们制定了一种结构严密的异质性分析方法,以确定子分组,选择重要的基因,并估计其对兴趣的苯类的影响。可能的数据污染通过使用Huber损失功能得到缓解。一个分散的重叠的分组惩罚是为了进行正规化估计和基因识别,同时考虑到基因可能重叠的组结构。这一方法在类似K-手段组合的精神中采用了一种迭代战略。模拟表明,拟议的方法超越了基因循环的替代方法,从而在揭示基因循环稳定性研究的每个分组中选择了重要的基因循环分析。