Much of the micro data used for epidemiological studies contain sensitive measurements on real individuals. As a result, such micro data cannot be published out of privacy concerns, rendering any published statistical analyses on them nearly impossible to reproduce. To promote the dissemination of key datasets for analysis without jeopardizing the privacy of individuals, we introduce a cohesive Bayesian framework for the generation of fully synthetic, high dimensional micro datasets of mixed categorical, binary, count, and continuous variables. This process centers around a joint Bayesian model that is simultaneously compatible with all of these data types, enabling the creation of mixed synthetic datasets through posterior predictive sampling. Furthermore, a focal point of epidemiological data analysis is the study of conditional relationships between various exposures and key outcome variables through regression analysis. We design a modified data synthesis strategy to target and preserve these conditional relationships, including both nonlinearities and interactions. The proposed techniques are deployed to create a synthetic version of a confidential dataset containing dozens of health, cognitive, and social measurements on nearly 20,000 North Carolina children.
翻译:流行病学研究所用的大部分微观数据都包含对真实个人的敏感测量数据,因此,由于隐私考虑,这类微观数据无法公布,因此几乎不可能再复制。为了在不损害个人隐私的情况下促进关键数据集的分析传播,我们采用了一个统一的巴伊西亚框架,以生成完整合成的高维微观数据集,包括混合的绝对数据、二元数据、计数数据以及连续变量。这一过程围绕一个与所有这些数据类型同时兼容的巴伊西亚联合模型进行,从而能够通过事后预测抽样建立混合合成数据集。此外,流行病学数据分析的一个中心是通过回归分析研究各种暴露与关键结果变量之间的有条件关系。我们设计了一个经过修改的数据综合战略,以瞄准并维护这些有条件的关系,包括非线性和互动性。拟议技术用于建立一个包含近20,000名北卡罗来纳州儿童健康、认知和社会测量结果的合成数据集的合成版本。