As genomic research has become increasingly widespread in recent years, few studies share datasets due to the sensitivity in privacy of genomic records. This hinders the reproduction and validation of research outcomes, which are crucial for catching errors (e.g., miscalculations) during the research process.To the best of our knowledge, we are the first to propose a method of sharing genomic datasets in a privacy-preserving manner for GWAS outcome reproducibility.In this work, we introduce a differential privacy-based scheme for sharing genomic datasets to enhance the reproducibility of genome-wide association studies (GWAS) outcomes. The scheme involves two stages. In the first stage, we generate a noisy copy of the target dataset by applying the XOR mechanism on the binarized (encoded) dataset, where the binary noise generation considers biological features. However, the initial step introduces significant noise, making the dataset less suitable for direct GWAS validation. Thus, in the second stage, we implement a post-processing technique that adjusts the Minor Allele Frequency (MAF) values in the noisy dataset to align more closely with those in a publicly available dataset using optimal transport and decode it back to genomic space. We evaluated the proposed scheme on three real-life genomic datasets and compared it with a baseline approach and two synthesis-based solutions with regard to detecting errors of GWAS outcomes, data utility, and resistance against membership inference attacks (MIAs). Our scheme outperforms all the comparing methods in detecting GWAS outcome errors, achieves better utility and provides higher privacy protection against membership inference attacks (MIAs). By utilizing our method, genomic researchers will be inclined to share a differentially private, yet of high quality version of their datasets.
翻译:暂无翻译