Risk modeling with EHR data is challenging due to a lack of direct observations on the disease outcome, and the high dimensionality of the candidate predictors. In this paper, we develop a surrogate assisted semi-supervised-learning (SAS) approach to risk modeling with high dimensional predictors, leveraging a large unlabeled data on candidate predictors and surrogates of outcome, as well as a small labeled data with annotated outcomes. The SAS procedure borrows information from surrogates along with candidate predictors to impute the unobserved outcomes via a sparse working imputation model with moment conditions to achieve robustness against mis-specification in the imputation model and a one-step bias correction to enable interval estimation for the predicted risk. We demonstrate that the SAS procedure provides valid inference for the predicted risk derived from a high dimensional working model, even when the underlying risk prediction model is dense and the risk model is mis-specified. We present an extensive simulation study to demonstrate the superiority of our SSL approach compared to existing supervised methods. We apply the method to derive genetic risk prediction of type-2 diabetes mellitus using a EHR biobank cohort.
翻译:由于缺少对疾病结果的直接观察,以及候选预测器的高度多维性,以EHR数据进行风险建模具有挑战性。在本文中,我们开发了一种替代辅助半监督学习(SAS)方法,用高维预测器进行风险建模,利用关于候选预测器和结果代谢器的大量无标签数据,以及带有附加说明结果的贴标签的小型数据。SAS程序从代理人和候选预测器借阅信息,以便通过稀疏的工作估算模型和瞬间条件对未观测结果进行估算,从而实现稳健性,防止估算模型中的误差,并进行一步级偏差修正,以便能够对预测的风险进行期中估计。我们证明SAS程序为高维工作模型产生的预测风险提供了有效的推论,即使基本风险预测模型密度大,风险模型也错误地指定了。我们进行了广泛的模拟研究,以表明我们SLS方法与现有监督方法相比的优越性。我们运用了一种方法来利用EHR型二型糖尿病基因风险预测。