合成医学数据菲力和隐私 (Fidelity and Privacy of Synthetic Medical Data)

The digitization of medical records ushered in a new era of big data to clinical science, and with it the possibility that data could be shared, to multiply insights beyond what investigators could abstract from paper records. The need to share individual-level medical data to accelerate innovation in precision medicine continues to grow, and has never been more urgent, as scientists grapple with the COVID-19 pandemic. However, enthusiasm for the use of big data has been tempered by a fully appropriate concern for patient autonomy and privacy. That is, the ability to extract private or confidential information about an individual, in practice, renders it difficult to share data, since significant infrastructure and data governance must be established before data can be shared. Although HIPAA provided de-identification as an approved mechanism for data sharing, linkage attacks were identified as a major vulnerability. A variety of mechanisms have been established to avoid leaking private information, such as field suppression or abstraction, strictly limiting the amount of information that can be shared, or employing mathematical techniques such as differential privacy. Another approach, which we focus on here, is creating synthetic data that mimics the underlying data. For synthetic data to be a useful mechanism in support of medical innovation and a proxy for real-world evidence, one must demonstrate two properties of the synthetic dataset: (1) any analysis on the real data must be matched by analysis of the synthetic data (statistical fidelity) and (2) the synthetic data must preserve privacy, with minimal risk of re-identification (privacy guarantee). In this paper we propose a framework for quantifying the statistical fidelity and privacy preservation properties of synthetic datasets and demonstrate these metrics for synthetic data generated by Syntegra technology.

翻译：医疗记录的数字化带来了一个将大数据用于临床科学的新时代,随着数据共享的可能性,数据可以共享,使洞察力超出调查人员可以从纸质记录中提取的精髓。随着科学家们努力应对COVID-19大流行,共享个人一级医疗数据以加速精密医学创新的必要性继续增长,而且从未如此迫切。然而,对使用大数据的热情由于对病人自主性和隐私的完全适当的关注而减弱了。也就是说,获取个人私密或机密信息的能力在实践中使得数据难以共享,因为必须建立重要的基础设施和数据治理才能共享数据。尽管HIPAAA提供个人一级医疗数据去识别,作为数据共享的核定机制,但链接袭击被确定为一种重大的脆弱性。已经建立了各种机制,以避免泄露私人信息,如实地压制或抽象,严格限制可以共享的信息数量,或者使用诸如差异隐私等数学技术。我们在这里关注的另一种方法是创建合成数据,用以模拟基础数据,因为必须在数据共享之前建立重要的基础设施和数据治理。对于合成数据的准确性进行两种合成数据的准确性来说,必须用一种合成数据支持合成数据进行真实性分析,从而证明真实性的数据的精确性。