The increased use of text data in social science research has benefited from easy-to-access data (e.g., Twitter). That trend comes at the cost of research requiring sensitive but hard-to-share data (e.g., interview data, police reports, electronic health records). We introduce a solution to that stalemate with the open-source text anonymisation software_Textwash_. This paper presents the empirical evaluation of the tool using the TILD criteria: a technical evaluation (how accurate is the tool?), an information loss evaluation (how much information is lost in the anonymisation process?) and a de-anonymisation test (can humans identify individuals from anonymised text data?). The findings suggest that Textwash performs similar to state-of-the-art entity recognition models and introduces a negligible information loss of 0.84%. For the de-anonymisation test, we tasked humans to identify individuals by name from a dataset of crowdsourced person descriptions of very famous, semi-famous and non-existing individuals. The de-anonymisation rate ranged from 1.01-2.01% for the realistic use cases of the tool. We replicated the findings in a second study and concluded that Textwash succeeds in removing potentially sensitive information that renders detailed person descriptions practically anonymous.
翻译:在社会科学研究中更多地使用文本数据得益于易于获取的数据(例如推特);这一趋势是以需要敏感但难以分享的数据(例如访谈数据、警察报告、电子健康记录)的研究成本(例如访谈数据、警察报告、电子健康记录等)为代价的;我们采用开放源代码的匿名化软件_Textwash_为这一僵局引入了解决办法;本文介绍了使用TILD标准对工具的经验性评估:技术评估(工具的准确性如何?)、信息损失评估(在匿名化过程中丢失了多少信息?)和匿名化测试(人类能够从匿名化文本数据中识别个人吗?) 研究结果表明,文本洗与最新实体识别模式相似,并提出了可忽略的信息损失0.84%。在去除匿名化测试中,我们委托人类从众源个人对非常出名、半臭名化和无名化个人描述的数据集中找出个人的名字。在1.01-2的匿名化测试中,去除了匿名率从1.01-2的匿名化个人识别个人身份,从实际的敏感数据中去除了信息。