Motivation: Electronic health record (EHR) data provides a new venue to elucidate disease comorbidities and latent phenotypes for precision medicine. To fully exploit its potential, a realistic data generative process of the EHR data needs to be modelled. We present MixEHR-S to jointly infer specialist-disease topics from the EHR data. As the key contribution, we model the specialist assignments and ICD-coded diagnoses as the latent topics based on patient's underlying disease topic mixture in a novel unified supervised hierarchical Bayesian topic model. For efficient inference, we developed a closed-form collapsed variational inference algorithm to learn the model distributions of MixEHR-S. We applied MixEHR-S to two independent large-scale EHR databases in Quebec with three targeted applications: (1) Congenital Heart Disease (CHD) diagnostic prediction among 154,775 patients; (2) Chronic obstructive pulmonary disease (COPD) diagnostic prediction among 73,791 patients; (3) future insulin treatment prediction among 78,712 patients diagnosed with diabetes as a mean to assess the disease exacerbation. In all three applications, MixEHR-S conferred clinically meaningful latent topics among the most predictive latent topics and achieved superior target prediction accuracy compared to the existing methods, providing opportunities for prioritizing high-risk patients for healthcare services. MixEHR-S source code and scripts of the experiments are freely available at https://github.com/li-lab-mcgill/mixehrS
翻译:激励:电子健康记录(EHR)数据为阐明疾病发病率和潜在精密药物的隐性苯酚类型提供了一个新的场所。为了充分挖掘其潜力,我们需要模拟EHR数据的现实数据基因化过程。我们提出MixEHR-S,从EHR数据中联合推导专家疾病专题。作为关键贡献,我们将专家任务和IDC编码诊断作为基于患者病症基本主题混合物的潜伏话题,在一种新型统一的Bayesian等级监督实验主题模型中进行。为了高效推断,我们开发了一种封闭式变异性变异性变异算法,以学习MixEHR-S的模型分布。我们将MixEHR-S应用到魁北克的两个独立的大型EHR数据库,有三个有针对性的应用:(1) 遗传性心脏病(CHD)诊断预测在154 775名病人中进行;(2) 慢性阻塞性皮肤病(COPD) 在73,791个病人中进行诊断性预测;(3) 未来在78,712名经糖尿病诊断的病人中进行肝素-HR712个病人的治疗预测,将糖尿病作为高额的预测,并进行高额预测。