Motivation: Researchers need a rich trove of genomic datasets that they can leverage to gain a better understanding of the genetic basis of the human genome and identify associations between phenotypes and specific parts of DNA. However, sharing genomic datasets that include sensitive genetic or medical information of individuals can lead to serious privacy-related consequences if data lands in the wrong hands. Restricting access to genomic datasets is one solution, but this greatly reduces their usefulness for research purposes. To allow sharing of genomic datasets while addressing these privacy concerns, several studies propose privacy-preserving mechanisms for data sharing. Differential privacy (DP) is one of such mechanisms that formalize rigorous mathematical foundations to provide privacy guarantees while sharing aggregated statistical information about a dataset. However, it has been shown that the original privacy guarantees of DP-based solutions degrade when there are dependent tuples in the dataset, which is a common scenario for genomic datasets (due to the existence of family members). Results: In this work, we introduce a near-optimal mechanism to mitigate the vulnerabilities of the inference attacks on differentially private query results from genomic datasets including dependent tuples. We propose a utility-maximizing and privacy-preserving approach for sharing statistics by hiding selective SNPs of the family members as they participate in a genomic dataset. By evaluating our mechanism on a real-world genomic dataset, we empirically demonstrate that our proposed mechanism can achieve up to 40% better privacy than state-of-the-art DP-based solutions, while near-optimally minimizing the utility loss.
翻译:动机:研究人员需要丰富的基因组数据集,以便他们能够利用这些数据来更好地了解人类基因组的基因基础,并查明苯型和DNA具体部分之间的关联。然而,共享包含个人敏感基因或医疗信息的基因组数据集,如果数据落地不法,可能导致严重的隐私相关后果。限制对基因组数据集的访问是一个解决办法,但这大大降低了其对研究目的的用处。为了在解决这些隐私问题的同时共享基因组数据集,一些研究提议了数据共享的隐私保存机制。不同隐私(DP)是正式确定严格数学基础的一种机制,以提供隐私保障,同时共享有关数据集的汇总统计资料。但是,已经表明,如果数据集中存在依赖图腾时,基于基因组的解决方案的原始隐私保障会降低与隐私有关的后果。 限制基因组数据集(由于家庭成员存在解决方案)的常见假设。结果:在这项工作中,我们引入了一种近于最佳的机械化机制,以降低数据共享隐私机制的弱点。我们通过在线的系统化数据存储系统,我们通过在线的估算,可以实现一个在线数据存储模式的软性数据采集结果。