Unstructured textual data is at the heart of healthcare systems. For obvious privacy reasons, these documents are not accessible to researchers as long as they contain personally identifiable information. One way to share this data while respecting the legislative framework (notably GDPR or HIPAA) is, within the medical structures, to de-identify it, i.e. to detect the personal information of a person through a Named Entity Recognition (NER) system and then replacing it to make it very difficult to associate the document with the person. The challenge is having reliable NER and substitution tools without compromising confidentiality and consistency in the document. Most of the conducted research focuses on English medical documents with coarse substitutions by not benefiting from advances in privacy. This paper shows how an efficient and differentially private de-identification approach can be achieved by strengthening the less robust de-identification method and by adapting state-of-the-art differentially private mechanisms for substitution purposes. The result is an approach for de-identifying clinical documents in French language, but also generalizable to other languages and whose robustness is mathematically proven.
翻译:由于明显的隐私原因,研究人员无法查阅这些文件,只要这些文件包含个人可识别的信息。在尊重立法框架(特别是GDPR或HIPAA)的同时分享这些数据的一个方法就是在医疗结构内,通过命名实体识别系统(NER)检测个人个人信息,然后取而代之,使其很难与个人联系起来。挑战在于是否有可靠的NER和替代工具,同时又不损害文件的保密性和一致性。所进行的研究大多侧重于英文医疗文件,其粗略的替代方法是不从隐私进步中受益的。本文说明了如何通过加强较不健全的非识别方法和为替代目的调整最先进的个人机制,从而实现高效和差别化的私人身份识别方法。其结果是用法语解辨临床文件,但也可推广到其他语言,其稳健性得到了数学的证明。