We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored towards measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymisation-benchmark
翻译:我们提出了一个新的基准和相关的评估指标,用于评估文本匿名方法的绩效; 文本匿名化,定义为编辑文本文件以防止披露个人信息的任务,目前缺乏以隐私为导向的附加说明文本资源,因此难以适当评估各种匿名方法提供的隐私保护水平; 本文介绍了TAB(语言匿名基准),这是为解决这一缺陷而开发的一个新的、开放源码附加说明材料; 材料包括欧洲人权法院(ECHR)的1,268个英语法院案件,其中丰富了每份文件中的个人信息的全面说明,包括其语义类别、识别类型、保密属性和共同参照关系。 与以往的工作相比,TAB堆的设计超出了传统的去身份识别(限于检测预先定义的语义类别),明确标注哪些文字应当遮掩,以掩盖受保护的人的身份。 除了介绍材料及其说明层外,我们还提议一套评价性基线指标集,具体用来衡量隐私/基准标准的业绩。