Objective For the UK Biobank standardized phenotype codes are associated with patients who have been hospitalized but are missing for many patients who have been treated exclusively in an outpatient setting. We describe a method for phenotype recognition that imputes phenotype codes for all UK Biobank participants. Materials and Methods POPDx (Population-based Objective Phenotyping by Deep Extrapolation) is a bilinear machine learning framework for simultaneously estimating the probabilities of 1,538 phenotype codes. We extracted phenotypic and health-related information of 392,246 individuals from the UK Biobank for POPDx development and evaluation. A total of 12,803 ICD-10 diagnosis codes of the patients were converted to 1,538 Phecodes as gold standard labels. The POPDx framework was evaluated and compared to other available methods on automated multi-phenotype recognition. Results POPDx can predict phenotypes that are rare or even unobserved in training. We demonstrate substantial improvement of automated multi-phenotype recognition across 22 disease categories, and its application in identifying key epidemiological features associated with each phenotype. Conclusions POPDx helps provide well-defined cohorts for downstream studies. It is a general purpose method that can be applied to other biobanks with diverse but incomplete data.
翻译:英国生物银行标准化苯并型代码的目标与住院病人有关,但许多在门诊环境中接受治疗的病人却缺少这种信息。我们描述一种对人体型识别方法,即为英国所有生物银行参与者输入苯并型代码。材料和方法POPDx(基于人口的深层外推法目标图解)是一个双线机器学习框架,用于同时估计1,538个苯并型代码的概率。我们从英国生物银行提取了392,246个个人在POPDx开发和评估方面的胎儿型和健康相关信息。总共12,803个ICD-10病人诊断代码被转换为1,538个Phecode,作为金标准标签。对POPDx框架进行了评价,并将其与其他现有自动化多苯型识别方法进行比较比较。结果POPDx可以预测在培训中罕见或甚至看不到的苯并类型。我们从22个疾病类别中发现自动多苯并型识别的多型计算机型信息,并在确定与每种不完全的生物多样性数据类型相关的关键流行病学特征时应用。POPDD可以提供其他定义的下游数据。