从有本体学和薄弱监督的临床记录中查出的罕见疾病 (Rare Disease Identification from Clinical Notes with Ontologies and Weak Supervision)

The identification of rare diseases from clinical notes with Natural Language Processing (NLP) is challenging due to the few cases available for machine learning and the need of data annotation from clinical experts. We propose a method using ontologies and weak supervision. The approach includes two steps: (i) Text-to-UMLS, linking text mentions to concepts in Unified Medical Language System (UMLS), with a named entity linking tool (e.g. SemEHR) and weak supervision based on customised rules and Bidirectional Encoder Representations from Transformers (BERT) based contextual representations, and (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). Using MIMIC-III discharge summaries as a case study, we show that the Text-to-UMLS process can be greatly improved with weak supervision, without any annotated data from domain experts. Our analysis shows that the overall pipeline processing discharge summaries can surface rare disease cases, which are mostly uncaptured in manual ICD codes of the hospital admissions.

翻译：自然语言处理(NLP)临床笔记中的稀有疾病很难识别,因为机器学习的病例很少,而且需要临床专家提供数据说明。我们提出一种使用本体学和薄弱监督的方法。这个方法包括两个步骤:(一) 文本到UMLS,将提到统一医疗语言系统概念的文字链接到统一医疗语言系统(UMLS)中,有一个名称实体连接工具(例如SemEHR),根据定制规则进行监管的薄弱,以及基于变异器背景介绍的双向编码显示,以及(二) UMLS到ORDO,将UMLS概念匹配到孤儿病病肿瘤学(ORDO)中的稀有疾病。我们用MIMICIII排放摘要作为案例研究,表明文本到UMLS进程在监管薄弱的情况下可以大大改进,而没有来自域专家的任何附加说明的数据。我们的分析表明,总管道处理排放摘要可以呈现出罕见疾病病例,而这些病例大多没有在医院入院的人工ICD编码中。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

【图神经网络导论】Intro to Graph Neural Networks，176页ppt

专知会员服务

127+阅读 · 2021年6月4日

【ETH】最新《几何数据分析》2020课程，附PPT下载

专知会员服务

44+阅读 · 2020年12月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日