Motivated by the growing availability of personal genomics services, we study an information-theoretic privacy problem that arises when sharing genomic data: a user wants to share his or her genome sequence while keeping the genotypes at certain positions hidden, which could otherwise reveal critical health-related information. A straightforward solution of erasing (masking) the chosen genotypes does not ensure privacy, because the correlation between nearby positions can leak the masked genotypes. We introduce an erasure-based privacy mechanism with perfect information-theoretic privacy, whereby the released sequence is statistically independent of the sensitive genotypes. Our mechanism can be interpreted as a locally-optimal greedy algorithm for a given processing order of sequence positions, where utility is measured by the number of positions released without erasure. We show that finding an optimal order is NP-hard in general and provide an upper bound on the optimal utility. For sequences from hidden Markov models, a standard modeling approach in genetics, we propose an efficient algorithmic implementation of our mechanism with complexity polynomial in sequence length. Moreover, we illustrate the robustness of the mechanism by bounding the privacy leakage from erroneous prior distributions. Our work is a step towards more rigorous control of privacy in genomic data sharing.
翻译:在个人基因组服务日益普及的推动下,我们研究了在共享基因组数据时产生的信息理论隐私问题:用户希望分享其基因组序列,同时将基因组类型隐藏在某些位置,否则可能暴露出与健康有关的重要信息。一种直接的解决办法是删除(制模)所选基因组类型,因为相近位置之间的关联可能泄露隐蔽基因组类型。我们采用了一种基于消除的隐私机制,这种机制在统计上独立于敏感的基因组类型。我们的机制可以被解释为对某一处理序列位置的当地最优贪婪算法,而这种算法的效用则以不加密而释放的位置的数量来衡量。我们表明,找到一个最佳的顺序是一般的NP-硬性,为最佳用途提供了上层界限。对于隐蔽的Markov模式的序列,一种标准的遗传学模型,我们建议一种高效的算法实施我们的机制,其复杂性是多式基因组式的。此外,我们用一个错误的保密性方法来说明我们之前的隐私分配机制的精确性。我们用一种错误的保密性的方法来分享我们的保密性。