临床笔记本中以本部为基础、由弱力监督的罕见疾病 (Ontology-Based and Weakly Supervised Rare Disease Phenotyping from Clinical Notes)

from arxiv, 12 pages, 4 figures, submitted to IEEE Journal of Biomedical and Health Informatics, with supplementary materials (4 extra pages, 1 extra figure)

Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-based framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets of discharge summaries and radiology reports from two institutions in the US and the UK. Our best weakly supervised method achieved 81.4% precision and 91.4% recall on extracting rare disease UMLS phenotypes from MIMIC-III discharge summaries. The overall pipeline processing clinical notes can surface rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes). Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. We discuss the usefulness of the weak supervision approach and propose directions for future studies.

翻译：计算文本书写是确定某些疾病患者和临床笔记中的特征的做法。由于机器学习的病例很少,而且需要来自域专家的数据注释,稀有疾病是难以确定的。我们提出一种使用本体学和薄弱监督的方法,最近由双向变形器(如BERT)提供经过预先培训的背景介绍。基于肿瘤的框架包括两个步骤:(一) 文本到UMLS,通过从背景角度将提及的概念与统一医疗语言系统(UMLS)中的概念链接,主要提取苯型。由于机器学习的病例很少,而且需要来自域专家的数据注释,因此难以确定三体实体识别和链接(NER+L)工具,以及缺乏对定制规则及背景引用代表的监管;(二) UMLS-ORDO,将UMLS概念与孤儿鼠疫病 Ontology (ORDO) 中的罕见疾病(ODODO) 。提议采用监督薄弱的系统确认模式,改进文本到UMLS的链接链接,而没有由域专家提供两个附加的数据。我们评估了三体结构结构化的 NEDRIMLSD 的系统运行报告, 和最新的准确解解解算的系统报告,在三个临床报告中,用于最弱的IMFSIMFSIMLS 和最精化的流解的准确性解的系统。