项目名称: 整合高维基因数据和临床特征的生存预后模型构建
项目编号: No.81473073
项目类型: 面上项目
立项/批准年度: 2015
项目学科: 医药、卫生
项目作者: 王彤
作者单位: 山西医科大学
项目金额: 85万元
中文摘要: 从基因水平研究患者的生存预后倍受重视,但基因数据的高维度、强相关、小样本特点令传统的生存分析方法面临挑战;研究表明同时考虑临床特征和高维基因数据构建生存预后模型较之单独考虑其中一个方面的自变量建模有更好的预测效果,而将高维基因数据结合已有的生物信息学知识可使得模型更具解释性和临床实际应用价值。本研究以弥漫大B细胞淋巴瘤患者的预后分析为例,提出一个整合高维基因数据和重要临床特征的建模策略。首先通过模拟研究找出目前超高维自变量选择方法中性质较好者,然后采用较优的变量选择方法和部分惩罚思想,在保留重要的临床特征变量IPI进入模型的条件下筛选出预后基因,之后根据候选基因的功能和/或信息通路将其构建为不同的少数几个潜变量,用同时含潜变量和IPI的Cox回归评价不同备选模型的拟合优度,最后通过前瞻性随访研究得到独立的验证数据,用时依ROC曲线比较备选模型的预测准确性,测试所构造潜变量及模型的预后效果。
中文关键词: 高维数据;变量选择;潜变量;生存分析
英文摘要: Predicting the outcome of a disease or some disease related phenotype based on microaaray or other high-throughput data is an important application of genomic data. One particular instance of this problem is the prediction of time to some disease specific event like death or relapse, often referred to by the term survival time. Common strategies for analysis of genomic data are limited by their features, such as high dimension, strong correlation and small samples. Meanwhile, outcome of a disease such as patients' survival durations also depend on some important clinical features. It is repoerted that combining high-dimentional genomic data and important clinical features may lead to better prediction than what can be achieved using the data seperately. In addition, results will be more explicable and practical in clinic if high-dimentional genomic data is combined with prior biological knowledge. This study will take prognosis analysis of diffuse large B cell lymphoma patients as an example to illustrate our trateges for survival prediction model building based on integration of high-dimensional gene data and clinical features. First, The better ultra-high dimensional variable selection method will be determined by simulation study, then it will be used to select candidate genes by partial penalty with the important clinical feature, IPI, entered into the Cox model. Then different kinds of certain latent variables will be constructed in terms of the similar function and/or the common information pathway of candidate genes, and goodness-of-fit tests of alternative models with IPI and latent variables will be evaluated according to the latent Cox regression. Finally, time-dependent ROC curve method will be used to determine the best model by comparing predictive accuracy of candidate models in an independent validation dataset, and in an independent test dataset the predicted survival duration and actural survival duration will be compared to test the predictive accuracy of final model. This stategy based on integration of high-dimensional gene data and clinical features will improve the predictive capability of survival model.
英文关键词: high-dimensional data;variable selection;latent variables;survival analysis