计量合成基因组数据的功用和隐私 (Measuring Utility and Privacy of Synthetic Genomic Data)

Genomic data provides researchers with an invaluable source of information to advance progress in biomedical research, personalized medicine, and drug development. At the same time, however, this data is extremely sensitive, which makes data sharing, and consequently availability, problematic if not outright impossible. As a result, organizations have begun to experiment with sharing synthetic data, which should mirror the real data's salient characteristics, without exposing it. In this paper, we provide the first evaluation of the utility and the privacy protection of five state-of-the-art models for generating synthetic genomic data. First, we assess the performance of the synthetic data on a number of common tasks, such as allele and population statistics as well as linkage disequilibrium and principal component analysis. Then, we study the susceptibility of the data to membership inference attacks, i.e., inferring whether a target record was part of the data used to train the model producing the synthetic dataset. Overall, there is no single approach for generating synthetic genomic data that performs well across the board. We show how the size and the nature of the training dataset matter, especially in the case of generative models. While some combinations of datasets and models produce synthetic data with distributions close to the real data, there often are target data points that are vulnerable to membership inference. Our measurement framework can be used by practitioners to assess the risks of deploying synthetic genomic data in the wild, and will serve as a benchmark tool for researchers and practitioners in the future.

翻译：基因组数据为研究人员提供了宝贵的信息来源,以推进生物医学研究、个性化医学和药物开发的进展。但与此同时,这些数据极其敏感,使得数据共享、因此提供甚至根本不可能完全不可能。因此,各组织开始试验共享合成数据,这些数据应当反映真实数据的显著特征,而不暴露这些数据的显著特征。在本文件中,我们提供了对五个最先进的合成基因组数据生成模型的效用和隐私保护的首次评估。首先,我们评估了一些共同任务,例如全方位和人口统计以及连接不均和主要组成部分分析等合成数据的性能。然后,我们研究数据是否易于被归属于会籍攻击,即推断目标记录是否是用于培训合成数据集模型生成模型的数据的一部分。总体而言,没有一种单一的方法来生成综合的合成基因组数据。我们展示了培训数据设置事项的规模和性质,特别是联系不均匀和主要组成部分分析。然后,我们研究数据是否容易成为成员攻击的对象,即推断指标记录是否是用来对数据进行精确的模型进行精确的组合,同时将数据作为我们所使用的综合数据采集的模型中的某些数据进行精确的组合。