Identifying patient cohorts from clinical notes in secondary electronic health records is a fundamental task in clinical information management. The patient cohort identification needs to identify the patient phenotypes. However, with the growing number of clinical notes, it becomes challenging to analyze the data manually. Therefore, automatic extraction of clinical concepts would be an essential task to identify the patient phenotypes correctly. This paper proposes a novel hybrid model for automatically extracting patient phenotypes using natural language processing and deep learning models to determine the patient phenotypes without dictionaries and human intervention. The proposed hybrid model is based on a neural bidirectional sequence model (BiLSTM or BiGRU) and a Convolutional Neural Network (CNN) for identifying patient's phenotypes in discharge reports. Furthermore, to extract more features related to each phenotype, an extra CNN layer is run parallel to the hybrid proposed model. We used pre-trained embeddings such as FastText and Word2vec separately as the input layers to evaluate other embedding's performance in identifying patient phenotypes. We also measured the effect of applying additional data cleaning steps on discharge reports to identify patient phenotypes by deep learning models. We used discharge reports in the Medical Information Mart for Intensive Care III (MIMIC III) database. Experimental results in internal comparison demonstrate significant performance improvement over existing models. The enhanced model with an extra CNN layer obtained a relatively higher F1-score than the original hybrid model.
翻译:从二级电子健康记录中的临床笔记中找出病人组群是临床信息管理的一项基本任务。 病人组群的鉴定需要确定病人的双向序列模型( BILSTM 或 BIGRU ) 。 但是,随着临床笔记数量不断增加, 手工分析数据变得很困难。 因此, 自动提取临床概念将是一项重要任务, 以便正确识别病人的苯型类型。 本文提出了一个新型混合模型, 用于利用自然语言处理和深层次学习模型自动提取病人的苯型。 我们使用预先训练的嵌入模型, 如FastText 和 Word2vec 分别作为输入层, 来评价在确定病人双向双向序列模型( BILSTM 或 BIGRU ) 方面的其它嵌入性工作表现。 我们还测量了在排放报告中识别病人的病人的双向神经网络( CNN ), 并且为了提取更多的与每个苯型号相关的特征, 额外的CNNA 和拟议模型平行的嵌式 。 我们还测量了在不断递增压的运行的I 级数据库 。