Electronic Health Records (EHRs) have become the primary form of medical data-keeping across the United States. Federal law restricts the sharing of any EHR data that contains protected health information (PHI). De-identification, the process of identifying and removing all PHI, is crucial for making EHR data publicly available for scientific research. This project explores several deep learning-based named entity recognition (NER) methods to determine which method(s) perform better on the de-identification task. We trained and tested our models on the i2b2 training dataset, and qualitatively assessed their performance using EHR data collected from a local hospital. We found that 1) BiLSTM-CRF represents the best-performing encoder/decoder combination, 2) character-embeddings and CRFs tend to improve precision at the price of recall, and 3) transformers alone under-perform as context encoders. Future work focused on structuring medical text may improve the extraction of semantic and syntactic information for the purposes of EHR de-identification.
翻译:联邦法律限制分享含有受保护健康信息的任何EHR数据。 确定身份、查明和删除所有PHI的过程,对于公开提供用于科学研究的EHR数据至关重要。该项目探索了几种以深层次学习为基础的实体识别(NER)方法,以确定哪些方法能更好地完成取消身份的任务。我们用i2b2培训数据集培训和测试了我们的模型,并利用从当地一家医院收集的EHR数据对其绩效进行了质量评估。我们发现:(1) BILSTM-CRF代表了最佳的编码/编码组合,(2) 特性组合和通用报告格式往往提高召回价格的精确度,(3) 仅作为背景编码编码器的变异器单独处于状态。