机器与人：在大语言模型时代有效文本匿名化的挑战 (Man vs the machine: The Struggle for Effective Text Anonymisation in the Age of Large Language Models)

The collection and use of personal data are becoming more common in today's data-driven culture. While there are many advantages to this, including better decision-making and service delivery, it also poses significant ethical issues around confidentiality and privacy. Text anonymisation tries to prune and/or mask identifiable information from a text while keeping the remaining content intact to alleviate privacy concerns. Text anonymisation is especially important in industries like healthcare, law, as well as research, where sensitive and personal information is collected, processed, and exchanged under high legal and ethical standards. Although text anonymization is widely adopted in practice, it continues to face considerable challenges. The most significant challenge is striking a balance between removing information to protect individuals' privacy while maintaining the text's usability for future purposes. The question is whether these anonymisation methods sufficiently reduce the risk of re-identification, in which an individual can be identified based on the remaining information in the text. In this work, we challenge the effectiveness of these methods and how we perceive identifiers. We assess the efficacy of these methods against the elephant in the room, the use of AI over big data. While most of the research is focused on identifying and removing personal information, there is limited discussion on whether the remaining information is sufficient to deanonymise individuals and, more precisely, who can do it. To this end, we conduct an experiment using GPT over anonymised texts of famous people to determine whether such trained networks can deanonymise them. The latter allows us to revise these methods and introduce a novel methodology that employs Large Language Models to improve the anonymity of texts.

翻译：随着当今数据驱动文化中个人数据的收集和使用越来越普遍，这虽然带来了许多优点，包括更好的决策制定和服务提供，但也对保密性和隐私性提出了重大的道德问题。文本匿名化尝试从文本中修剪和/或遮蔽可识别的信息，同时保持其余内容完好无损，以缓解隐私问题。文本匿名化在医疗、法律以及研究等行业中尤为重要，在这些行业中，敏感和个人信息是在高法律和伦理标准下收集、处理和交换的。虽然文本匿名化在实践中被广泛采用，但它仍然面临着重大的挑战。最大的挑战是在保护个人隐私的同时删除信息，同时保持文本的可用性以便未来的目的。问题是这些匿名化方法是否足以降低重新识别的风险，即基于文本中剩余的信息识别个人的风险。在这项工作中，我们挑战了这些方法的有效性以及我们如何看待标识符。我们评估这些方法对大数据上使用AI这头大象的有效性。虽然大多数的研究都集中在识别和删除个人信息上，但却很少讨论剩余信息是否足以反匿名化个人，更精确地说，是谁可以做到。为此，我们使用GPT对著名人士的匿名文本进行实验，以确定这样的训练网络是否可以反匿名化它们。后者使我们能够修订这些方法，并引入一种利用大语言模型提高文本匿名性的新方法。