For sensitive text data to be shared among NLP researchers and practitioners, shared documents need to comply with data protection and privacy laws. There is hence a growing interest in automated approaches for text anonymization. However, measuring such methods' performance is challenging: missing a single identifying attribute can reveal an individual's identity. In this paper, we draw attention to this problem and argue that researchers and practitioners developing automated text anonymization systems should carefully assess whether their evaluation methods truly reflect the system's ability to protect individuals from being re-identified. We then propose TILD, a set of evaluation criteria that comprises an anonymization method's technical performance, the information loss resulting from its anonymization, and the human ability to de-anonymize redacted documents. These criteria may facilitate progress towards a standardized way for measuring anonymization performance.
翻译:为使国家地名方案研究人员和从业人员共享敏感文本数据,共享文件需要遵守数据保护和隐私法,因此对自动文本匿名办法的兴趣日益浓厚,然而,衡量这种方法的性能具有挑战性:缺少单一识别属性可以揭示个人身份。在本文件中,我们提请注意这一问题,并主张开发自动文本匿名系统的研究人员和从业人员应认真评估其评价方法是否真正反映了该系统保护个人不被重新识别的能力。我们随后提议了一套评价标准TILD,其中包括匿名方法的技术性能、因匿名而产生的信息损失,以及重新编造文件的人去匿名能力。这些标准可能有助于在采用标准化方法衡量匿名性能方面取得进展。