Clinical text processing has gained more and more attention in recent years. The access to sensitive patient data, on the other hand, is still a big challenge, as text cannot be shared without legal hurdles and without removing personal information. There are many techniques to modify or remove patient related information, each with different strengths. This paper investigates the influence of different anonymization techniques on the performance of ML models using multiple datasets corresponding to five different NLP tasks. Several learnings and recommendations are presented. This work confirms that particularly stronger anonymization techniques lead to a significant drop of performance. In addition to that, most of the presented techniques are not secure against a re-identification attack based on similarity search.
翻译:近年来,临床文本处理受到越来越多的关注,另一方面,获取敏感的病人数据仍是一个巨大的挑战,因为没有法律障碍和不删除个人信息,就无法共享文本;有许多修改或删除与病人有关的信息的技术,每个技术都有不同的优点;本文件调查了不同匿名技术对使用与五个不同的国家语言方案任务相对应的多数据集进行 ML 模型操作的影响;介绍了一些学习和建议;这项工作证实,特别强的匿名技术导致显著的性能下降;此外,提出的大多数技术无法防止基于相似性搜索的重新识别攻击。