When an individual's DNA is sequenced, sensitive medical information becomes available to the sequencing laboratory. A recently proposed way to hide an individual's genetic information is to mix in DNA samples of other individuals. We assume these samples are known to the individual but unknown to the sequencing laboratory. Thus, these DNA samples act as "noise" to the sequencing laboratory, but still allow the individual to recover their own DNA samples afterward. Motivated by this idea, we study the problem of hiding a binary random variable X (a genetic marker) with the additive noise provided by mixing DNA samples, using mutual information as a privacy metric. This is equivalent to the problem of finding a worst-case noise distribution for recovering X from the noisy observation among a set of feasible discrete distributions. We characterize upper and lower bounds to the solution of this problem, which are empirically shown to be very close. The lower bound is obtained through a convex relaxation of the original discrete optimization problem, and yields a closed-form expression. The upper bound is computed via a greedy algorithm for selecting the mixing proportions.
翻译:当一个人的DNA被测序,敏感的医疗信息就会提供给测序实验室。最近提出的隐藏一个人的遗传信息的方法是将其他人的DNA样本混合在一起。我们假设这些样本是个人所知道的,但测序实验室不知道。因此,这些DNA样本作为测序实验室的“噪音”作用,但仍允许个人在测序实验室后再取回自己的DNA样本。我们受这个想法的启发,研究用混合DNA样本提供的添加噪音来隐藏一个二进制随机变异X(基因标记)的问题,使用相互的信息作为隐私度量。这相当于从一组可行的离心分布的热观测中找到最坏的噪音分布的问题。我们从这一问题的解决方案中确定上下界限,从经验上到下界限都显示非常接近。低界限是通过原离心优化问题松动获得的,并产生一种封闭式表达方式。通过贪婪的算法来计算混合比例。