In this paper, we consider the problem of answering count queries for genomic data subject to perfect privacy constraints. Count queries are often used in applications that collect aggregate (population-wide) information from biomedical Databases (DBs) for analysis, such as Genome-wide association studies. Our goal is to design mechanisms for answering count queries of the following form: How many users in the database have a specific set of genotypes at certain locations in their genome? At the same time, we aim to achieve perfect privacy (zero information leakage) of the sensitive genotypes at a pre-specified set of secret locations. The sensitive genotypes could indicate rare diseases and/or other health traits that one may want to keep private. We present two local count-query mechanisms for the above problem that achieve perfect privacy for sensitive genotypes while minimizing the expected absolute error (or per-user error probability) of the query answer. We also derived a lower bound of the per-user probability of error for an arbitrary query answering mechanism that satisfies perfect privacy. We show that our mechanisms achieve error that is close to the lower bound, and are match the lower bound for some special cases. We numerically show that the performance of each mechanism depends on the data prior distribution, the intersection between the queried and sensitive data, and the strength of the correlation in the genomic data sequence.
翻译:在本文中,我们考虑了在完全隐私限制的情况下回答基因组数据计数询问的问题。计数询问常常用于收集生物医学数据库(DBs)中综合(全人口)信息用于分析的应用中,例如全基因组协会研究。我们的目标是设计以下形式的计数查询的回答机制:数据库中有多少用户在其基因组的某些地点有一套特定的基因组类型?同时,我们的目标是在一套预先指定的保密地点实现敏感基因组类型的完美隐私(零信息渗漏) 。敏感的基因组类型可以表明罕见的疾病和/或人们可能希望保持隐私的其他健康特征。我们为上述问题提出了两个本地计数查询机制,在敏感基因类中实现完全的隐私,同时尽量减少其基因组答案中某些地点的绝对错误(或使用者误差概率)?我们还从用户错误的可能性中得出了一个较低的约束,用于满足完美隐私的任意查询机制。我们发现,我们的机制可以发现接近较低约束的错误,并且与人们可能希望保持隐私的罕见疾病和/或其他健康特性特征。我们提出了上述问题的两个当地计数机制,从而决定了某些特殊数据的准确性。我们判断了每个特殊数据的顺序。