We propose a constrained maximum partial likelihood estimator for dimension reduction in integrative (e.g., pan-cancer) survival analysis with high-dimensional covariates. We assume that for each population in the study, the hazard function follows a distinct Cox proportional hazards model. To borrow information across populations, we assume that all of the hazard functions depend only on a small number of linear combinations of the predictors. We estimate these linear combinations using an algorithm based on "distance-to-set" penalties. This allows us to impose both low-rankness and sparsity. We derive asymptotic results which reveal that our regression coefficient estimator is more efficient than fitting a separate proportional hazards model for each population. Numerical experiments suggest that our method outperforms related competitors under various data generating models. We use our method to perform a pan-cancer survival analysis relating protein expression to survival across 18 distinct cancer types. Our approach identifies six linear combinations, depending on only 20 proteins, which explain survival across the cancer types. Finally, we validate our fitted model on four external datasets and show that our estimated coefficients can lead to better prediction than popular competitors.
翻译:我们提出一个限制的最大部分概率估计值, 用于综合( 泛癌症)生存量的维度降低, 使用高维共变数进行维度( 如泛癌症) 生存分析。 我们假设研究中每个人群的危害函数都遵循不同的Cox比例危害模型。 要在人群中借阅信息, 我们假设所有危险函数都只依赖于少量的预测器线性组合。 我们使用基于“ 远程到定位” 惩罚的算法来估计这些线性组合。 这使我们能够强制实行低级别和宽度的分级。 我们得出无症状结果, 显示我们回归系数估计值比为每个人群安装一个单独的比例危害模型更有效。 数字实验表明, 我们的方法比不同数据生成模型下的相关竞争者要强。 我们使用的方法来进行与18种不同癌症类型的生存有关的蛋白蛋白蛋白质生存分析。 我们的方法确定了六种线性组合, 取决于仅仅20种蛋白质, 来解释癌症类型的存活率。 最后, 我们验证了四个外部数据集的匹配模型, 并显示我们估计的系数可以导致更好的竞争者。