The objective of this study is to address the critical issue of de-identification of clinical reports in order to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse. We annotated a corpus of clinical documents according to 12 types of identifying entities, and built a hybrid system, merging the results of a deep learning model as well as manual rules. Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.
翻译:本研究的目的是解决临床报告去识别化的关键问题,以便让研究人员获取数据,而同时确保患者隐私。本研究重点强调在这一领域分享工具和资源所面临的困难,并介绍了巴黎大区大学医院在其临床数据仓库中实施文本文档的系统化假名化的经验。我们根据12种身份识别实体对临床文档进行了注释,并构建了一个混合系统,合并了深度学习模型以及手动规则的结果。我们的结果显示,F1分数的总体表现为0.99。我们讨论了实施选择,并对更好地理解此任务所涉及的工作量进行了实验,包括数据集大小,文档类型,语言模型或规则添加。我们共享遵循3-Clause BSD许可证的指南和代码。