利用公众形象图像在基因组数据集中重新确定个人身份 (Re-identification of Individuals in Genomic Datasets Using Public Face Images)

DNA sequencing is becoming increasingly commonplace, both in medical and direct-to-consumer settings. To promote discovery, collected genomic data is often de-identified and shared, either in public repositories, such as OpenSNP, or with researchers through access-controlled repositories. However, recent studies have suggested that genomic data can be effectively matched to high-resolution three-dimensional face images, which raises a concern that the increasingly ubiquitous public face images can be linked to shared genomic data, thereby re-identifying individuals in the genomic data. While these investigations illustrate the possibility of such an attack, they assume that those performing the linkage have access to extremely well-curated data. Given that this is unlikely to be the case in practice, it calls into question the pragmatic nature of the attack. As such, we systematically study this re-identification risk from two perspectives: first, we investigate how successful such linkage attacks can be when real face images are used, and second, we consider how we can empower individuals to have better control over the associated re-identification risk. We observe that the true risk of re-identification is likely substantially smaller for most individuals than prior literature suggests. In addition, we demonstrate that the addition of a small amount of carefully crafted noise to images can enable a controlled trade-off between re-identification success and the quality of shared images, with risk typically significantly lowered even with noise that is imperceptible to humans.

翻译：在医学和直接到消费者的环境下,DNA测序越来越普遍。为了促进发现,收集的基因组数据往往在公开储存库(如OpenSNP)或通过访问控制储存库与研究人员进行分解和共享,然而,最近的研究表明,基因组数据可以有效地与高分辨率三维面貌图像相匹配,这就引起了一种关切,即日益普遍的公众脸部图像可以与共享的基因组数据联系起来,从而在基因组数据中重新识别个人。虽然这些调查表明可能发生这种攻击,但他们假设那些进行这种联系的人能够获得极其精确的数据。鉴于这种情况在实践中不大可能发生,它使人怀疑攻击的实用性。因此,我们从两个角度系统地研究这种重新识别风险:第一,我们调查在使用真实的面貌图像时,这种联系攻击能够在多大程度上成功,我们考虑我们如何能够使个人更有能力更好地控制相关的再识别风险。此外,我们发现再定位的真正风险对于大多数个人来说,即使进行这种联系的人来说,也可能是极其精确的数据。我们发现,对于经过仔细控制的图像来说,这种小的精确度也可能大大降低。