The increasing pace in genomic research has brought a high demand for genomic datasets in recent years, yet few studies have released their datasets due to privacy concerns. This poses a challenge in terms of reproducing and validating published research findings, which is necessary to avoid errors (e.g., miscalculations) during the research process.In this work, in order to promote reproducibility of genome-related research, we propose a novel scheme for sharing genomic datasets under differential privacy, which consists of two stages. In the first stage, the scheme generates a noisy copy of the genomic dataset by conducting the XOR operation between the binarized (encoded) dataset and binary noises. To preserve the biological features, entries of the noises are generated by considering the inherent correlation properties of the genomic data (obtained from publicly available datasets). In the second stage, the scheme alters the value distribution of each column in the generated copy to align with the privacy-preserving version (protected by the Laplace mechanism) of the distribution in the original dataset using optimal transport. We evaluate the proposed scheme on two real-life genomic datasets from OpenSNP compared with two existing privacy-preserving techniques, both of which are winners from NIST challenges. In regard to reproducing findings of the genome-wide association studies (considering the $\chi^2$ tests and the odd ratio tests), our scheme can detect even slight errors (e.g., miscalculations) that may occur during the research process, while other methods cannot even identify significant errors. Additionally, we indicate via experiments that our scheme has better data utility and achieves higher protection against membership inference attacks with lower time complexity.
翻译:暂无翻译