Many datasets contain personally identifiable information, or PII, which poses privacy risks to individuals. PII masking is commonly used to redact personal information such as names, addresses, and phone numbers from text data. Most modern PII masking pipelines involve machine learning algorithms. However, these systems may vary in performance, such that individuals from particular demographic groups bear a higher risk for having their personal information exposed. In this paper, we evaluate the performance of three off-the-shelf PII masking systems on name detection and redaction. We generate data using names and templates from the customer service domain. We find that an open-source RoBERTa-based system shows fewer disparities than the commercial models we test. However, all systems demonstrate significant differences in error rate based on demographics. In particular, the highest error rates occurred for names associated with Black and Asian/Pacific Islander individuals.
翻译:许多数据集包含个人可识别的信息,或个人隐私风险的PII。PII掩码通常用于编辑个人信息,如文字数据中的姓名、地址和电话号码。大多数现代PII掩码管道涉及机器学习算法,但这些系统在性能上可能存在差异,因此特定人口群体的个人暴露个人信息的风险较高。在本文中,我们评估了三个现成的PII掩码系统在识别和编辑姓名方面的性能。我们利用客户服务域的名称和模板生成数据。我们发现,基于开放源的 RoBERTA的系统显示的差异比我们测试的商业模型要小。然而,所有系统都显示基于人口统计的错误率差异很大,特别是与黑人和亚太岛民有关的名字的错误率最高。