基于BERT嵌入的心血管文本多语言临床命名实体识别用于疾病与药物识别 (Multilingual Clinical NER for Diseases and Medications Recognition in Cardiology Texts using BERT Embeddings)

The rapidly increasing volume of electronic health record (EHR) data underscores a pressing need to unlock biomedical knowledge from unstructured clinical texts to support advancements in data-driven clinical systems, including patient diagnosis, disease progression monitoring, treatment effects assessment, prediction of future clinical events, etc. While contextualized language models have demonstrated impressive performance improvements for named entity recognition (NER) systems in English corpora, there remains a scarcity of research focused on clinical texts in low-resource languages. To bridge this gap, our study aims to develop multiple deep contextual embedding models to enhance clinical NER in the cardiology domain, as part of the BioASQ MultiCardioNER shared task. We explore the effectiveness of different monolingual and multilingual BERT-based models, trained on general domain text, for extracting disease and medication mentions from clinical case reports written in English, Spanish, and Italian. We achieved an F1-score of 77.88% on Spanish Diseases Recognition (SDR), 92.09% on Spanish Medications Recognition (SMR), 91.74% on English Medications Recognition (EMR), and 88.9% on Italian Medications Recognition (IMR). These results outperform the mean and median F1 scores in the test leaderboard across all subtasks, with the mean/median values being: 69.61%/75.66% for SDR, 81.22%/90.18% for SMR, 89.2%/88.96% for EMR, and 82.8%/87.76% for IMR.

翻译：电子健康记录（EHR）数据的快速增长突显了从非结构化临床文本中挖掘生物医学知识的迫切需求，以支持数据驱动的临床系统发展，包括患者诊断、疾病进展监测、治疗效果评估、未来临床事件预测等。尽管情境化语言模型在英语语料库的命名实体识别（NER）系统中已展现出显著的性能提升，但针对低资源语言临床文本的研究仍然匮乏。为填补这一空白，本研究旨在开发多种深度情境嵌入模型，以增强心血管领域的临床NER，作为BioASQ MultiCardioNER共享任务的一部分。我们探索了基于通用领域文本训练的不同单语与多语言BERT模型在从英语、西班牙语和意大利语撰写的临床病例报告中提取疾病和药物提及的有效性。我们在西班牙语疾病识别（SDR）上取得了77.88%的F1分数，在西班牙语药物识别（SMR）上为92.09%，在英语药物识别（EMR）上为91.74%，在意大利语药物识别（IMR）上为88.9%。这些结果在所有子任务的测试排行榜中均超越了平均和中位数F1分数，其中平均/中位数值分别为：SDR 69.61%/75.66%，SMR 81.22%/90.18%，EMR 89.2%/88.96%，IMR 82.8%/87.76%。