基于GLINER的局部混淆实现公正上下文感知溯源：PII去除系统的开发与评估 (Local Obfuscation by GLINER for Impartial Context Aware Lineage: Development and evaluation of PII Removal system)

Removing Personally Identifiable Information (PII) from clinical notes in Electronic Health Records (EHRs) is essential for research and AI development. While Large Language Models (LLMs) are powerful, their high computational costs and the data privacy risks of API-based services limit their use, especially in low-resource settings. To address this, we developed LOGICAL (Local Obfuscation by GLINER for Impartial Context-Aware Lineage), an efficient, locally deployable PII removal system built on a fine-tuned Generalist and Lightweight Named Entity Recognition (GLiNER) model. We used 1515 clinical documents from a psychiatric hospital's EHR system. We defined nine PII categories for removal. A modern-gliner-bi-large-v1.0 model was fine-tuned on 2849 text instances and evaluated on a test set of 376 instances using character-level precision, recall, and F1-score. We compared its performance against Microsoft Azure NER, Microsoft Presidio, and zero-shot prompting with Gemini-Pro-2.5 and Llama-3.3-70B-Instruct. The fine-tuned GLiNER model achieved superior performance, with an overall micro-average F1-score of 0.980, significantly outperforming Gemini-Pro-2.5 (F1-score: 0.845). LOGICAL correctly sanitised 95% of documents completely, compared to 64% for the next-best solution. The model operated efficiently on a standard laptop without a dedicated GPU. However, a 2% entity-level false negative rate underscores the need for human-in-the-loop validation across all tested systems. Fine-tuned, specialised transformer models like GLiNER offer an accurate, computationally efficient, and secure solution for PII removal from clinical notes. This "sanitisation at the source" approach is a practical alternative to resource-intensive LLMs, enabling the creation of de-identified datasets for research and AI development while preserving data privacy, particularly in resource-constrained environments.

翻译：从电子健康记录（EHR）的临床笔记中去除个人可识别信息（PII）对于研究和人工智能开发至关重要。尽管大型语言模型（LLM）功能强大，但其高昂的计算成本以及基于API服务的数据隐私风险限制了其应用，尤其在资源受限环境中。为此，我们开发了LOGICAL（基于GLINER的局部混淆实现公正上下文感知溯源），这是一个高效、可本地部署的PII去除系统，建立在微调的通用轻量级命名实体识别（GLiNER）模型基础上。我们使用了来自精神病医院EHR系统的1515份临床文档，定义了九个需要去除的PII类别。基于modern-gliner-bi-large-v1.0模型，在2849个文本实例上进行微调，并在包含376个实例的测试集上使用字符级精确率、召回率和F1分数进行评估。我们将其性能与Microsoft Azure NER、Microsoft Presidio以及Gemini-Pro-2.5和Llama-3.3-70B-Instruct的零样本提示方法进行了比较。微调后的GLiNER模型取得了最优性能，总体微平均F1分数达到0.980，显著优于Gemini-Pro-2.5（F1分数：0.845）。LOGICAL能完全正确清理95%的文档，而次优方案仅达到64%。该模型在标准笔记本电脑上无需专用GPU即可高效运行。然而，2%的实体级假阴性率表明所有测试系统都需要人工参与验证。像GLiNER这样经过微调的专用Transformer模型，为临床笔记的PII去除提供了准确、计算高效且安全的解决方案。这种“源头清理”方法是资源密集型LLM的实用替代方案，能够在保护数据隐私的同时创建用于研究和AI开发的去标识化数据集，尤其在资源受限环境中具有重要价值。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日