Labeling patients in electronic health records with respect to their statuses of having a disease or condition, i.e. case or control statuses, has increasingly relied on prediction models using high-dimensional variables derived from structured and unstructured electronic health record data. A major hurdle currently is a lack of valid statistical inference methods for the case probability. In this paper, considering high-dimensional sparse logistic regression models for prediction, we propose a novel bias-corrected estimator for the case probability through the development of linearization and variance enhancement techniques. We establish asymptotic normality of the proposed estimator for any loading vector in high dimensions. We construct a confidence interval for the case probability and propose a hypothesis testing procedure for patient case-control labelling. We demonstrate the proposed method via extensive simulation studies and application to real-world electronic health record data.
翻译:在电子健康记录中将病人与其疾病或病状状况(即病例或控制状态)有关的病状或病状贴标签,越来越依赖使用从结构化和无结构化电子健康记录数据中得出的高维变量的预测模型,目前的一个主要障碍是缺乏关于病例概率的有效统计推论方法,在本文中,考虑到高维分散的预测后勤回归模型,我们建议通过开发线性化和差异增强技术,为病例概率提供一个新的、纠正偏差的估算器。我们为高维度的任何装载矢量设定了拟议的估计器的无症状常性。我们为病例概率建立一个信任间隔,并为病人病例控制标签提出假设测试程序。我们通过广泛的模拟研究和应用现实世界电子健康记录数据,展示了拟议方法。