Most tasks in NLP require labeled data. Data labeling is often done on crowdsourcing platforms due to scalability reasons. However, publishing data on public platforms can only be done if no privacy-relevant information is included. Textual data often contains sensitive information like person names or locations. In this work, we investigate how removing personally identifiable information (PII) as well as applying differential privacy (DP) rewriting can enable text with privacy-relevant information to be used for crowdsourcing. We find that DP-rewriting before crowdsourcing can preserve privacy while still leading to good label quality for certain tasks and data. PII-removal led to good label quality in all examined tasks, however, there are no privacy guarantees given.
翻译:NLP的大多数任务都需要贴上标签的数据标签。由于可缩放性的原因,数据标签往往在众包平台上进行。然而,只有在没有包含与隐私有关的信息的情况下,公共平台上公布数据才能做到。文本数据通常包含敏感信息,如个人姓名或地点。在这项工作中,我们调查如何删除个人可识别信息(PII)以及应用差异隐私重写,使含有与隐私有关信息的文本能够用于众包。我们发现,在众包之前的DP重新撰写可以维护隐私,同时仍然能为某些任务和数据带来良好的标签质量。 PII去除导致所有被审查的任务都具有良好的标签质量,然而,没有隐私保障。</s>