Textual health records of cancer patients are usually protracted and highly unstructured, making it very time-consuming for health professionals to get a complete overview of the patient's therapeutic course. As such limitations can lead to suboptimal and/or inefficient treatment procedures, healthcare providers would greatly benefit from a system that effectively summarizes the information of those records. With the advent of deep neural models, this objective has been partially attained for English clinical texts, however, the research community still lacks an effective solution for languages with limited resources. In this paper, we present the approach we developed to extract procedures, drugs, and diseases from oncology health records written in European Portuguese. This project was conducted in collaboration with the Portuguese Institute for Oncology which, besides holding over $10$ years of duly protected medical records, also provided oncologist expertise throughout the development of the project. Since there is no annotated corpus for biomedical entity extraction in Portuguese, we also present the strategy we followed in annotating the corpus for the development of the models. The final models, which combined a neural architecture with entity linking, achieved $F_1$ scores of $88.6$, $95.0$, and $55.8$ per cent in the mention extraction of procedures, drugs, and diseases, respectively.
翻译:癌症患者的文本健康记录通常很冗长且高度无结构化,这使得医务工作者需要花费很多时间来获取患者治疗过程的完整概述。由于这些限制可能导致次优和/或低效的治疗程序,医疗保健提供商将极大地受益于一个有效汇总这些记录信息的系统。随着深度神经模型的出现,这个目标已经部分实现了英语临床文本的情况,然而,研究社区仍然缺乏对资源有限语言的有效解决方案。在本文中,我们介绍了一种抽取欧洲葡萄牙语写的肿瘤医疗记录中的程序、药物和疾病的方法。该项目与葡萄牙肿瘤研究所合作开展,该机构除拥有超过10年的医疗记录外,还在整个项目开发过程中提供了肿瘤学专家的专业知识。由于在葡萄牙语中没有生物医学实体提取的注释语料库,因此我们还介绍了注释语料库以开发模型的策略。最终的模型将神经架构与实体链接相结合,分别在程序、药物和疾病的提及提取中实现了88.6%、95.0%和55.8%的F1分数。