Electronic health records (EHR) offer unprecedented opportunities for in-depth clinical phenotyping and prediction of clinical outcomes. Combining multiple data sources is crucial to generate a complete picture of disease prevalence, incidence and trajectories. The standard approach to combining clinical data involves collating clinical terms across different terminology systems using curated maps, which are often inaccurate and/or incomplete. Here, we propose sEHR-CE, a novel framework based on transformers to enable integrated phenotyping and analyses of heterogeneous clinical datasets without relying on these mappings. We unify clinical terminologies using textual descriptors of concepts, and represent individuals' EHR as sections of text. We then fine-tune pre-trained language models to predict disease phenotypes more accurately than non-text and single terminology approaches. We validate our approach using primary and secondary care data from the UK Biobank, a large-scale research study. Finally, we illustrate in a type 2 diabetes use case how sEHR-CE identifies individuals without diagnosis that share clinical characteristics with patients.
翻译:电子健康记录(EHR)为深入临床观察和预测临床结果提供了前所未有的机会。将多种数据来源结合起来对于全面了解疾病流行、发病率和轨迹至关重要。综合临床数据的标准方法涉及使用分类地图对不同术语系统的临床术语进行校准,这些地图往往不准确和/或不完整。我们在这里建议SEHR-CE,这是一个以变压器为基础的新框架,以便能够在不依赖这些图谱的情况下对各异临床数据集进行综合访问和分析。我们用概念的文字描述来统一临床术语,并将个人EHR作为文本的一部分。我们然后微调预先培训的语言模型,以更准确地预测疾病苯型,而不是非文本和单一术语。我们用英国生物库的初级和二级护理数据验证我们的方法,这是一个大规模的研究。最后,我们用2类糖尿病使用案例来说明SEHR-CE如何在没有诊断的情况下识别与病人共享临床特征的个人。