Removing Personally Identifiable Information (PII) from clinical notes in Electronic Health Records (EHRs) is essential for research and AI development. While Large Language Models (LLMs) are powerful, their high computational costs and the data privacy risks of API-based services limit their use, especially in low-resource settings. To address this, we developed LOGICAL (Local Obfuscation by GLINER for Impartial Context-Aware Lineage), an efficient, locally deployable PII removal system built on a fine-tuned Generalist and Lightweight Named Entity Recognition (GLiNER) model. We used 1515 clinical documents from a psychiatric hospital's EHR system. We defined nine PII categories for removal. A modern-gliner-bi-large-v1.0 model was fine-tuned on 2849 text instances and evaluated on a test set of 376 instances using character-level precision, recall, and F1-score. We compared its performance against Microsoft Azure NER, Microsoft Presidio, and zero-shot prompting with Gemini-Pro-2.5 and Llama-3.3-70B-Instruct. The fine-tuned GLiNER model achieved superior performance, with an overall micro-average F1-score of 0.980, significantly outperforming Gemini-Pro-2.5 (F1-score: 0.845). LOGICAL correctly sanitised 95% of documents completely, compared to 64% for the next-best solution. The model operated efficiently on a standard laptop without a dedicated GPU. However, a 2% entity-level false negative rate underscores the need for human-in-the-loop validation across all tested systems. Fine-tuned, specialised transformer models like GLiNER offer an accurate, computationally efficient, and secure solution for PII removal from clinical notes. This "sanitisation at the source" approach is a practical alternative to resource-intensive LLMs, enabling the creation of de-identified datasets for research and AI development while preserving data privacy, particularly in resource-constrained environments.
翻译:从电子健康记录(EHR)的临床笔记中去除个人可识别信息(PII)对于研究和人工智能开发至关重要。尽管大型语言模型(LLM)功能强大,但其高昂的计算成本以及基于API服务的数据隐私风险限制了其应用,尤其在资源受限环境中。为此,我们开发了LOGICAL(基于GLINER的局部混淆实现公正上下文感知溯源),这是一个高效、可本地部署的PII去除系统,建立在微调的通用轻量级命名实体识别(GLiNER)模型基础上。我们使用了来自精神病医院EHR系统的1515份临床文档,定义了九个需要去除的PII类别。基于modern-gliner-bi-large-v1.0模型,在2849个文本实例上进行微调,并在包含376个实例的测试集上使用字符级精确率、召回率和F1分数进行评估。我们将其性能与Microsoft Azure NER、Microsoft Presidio以及Gemini-Pro-2.5和Llama-3.3-70B-Instruct的零样本提示方法进行了比较。微调后的GLiNER模型取得了最优性能,总体微平均F1分数达到0.980,显著优于Gemini-Pro-2.5(F1分数:0.845)。LOGICAL能完全正确清理95%的文档,而次优方案仅达到64%。该模型在标准笔记本电脑上无需专用GPU即可高效运行。然而,2%的实体级假阴性率表明所有测试系统都需要人工参与验证。像GLiNER这样经过微调的专用Transformer模型,为临床笔记的PII去除提供了准确、计算高效且安全的解决方案。这种“源头清理”方法是资源密集型LLM的实用替代方案,能够在保护数据隐私的同时创建用于研究和AI开发的去标识化数据集,尤其在资源受限环境中具有重要价值。