Safe and reliable disclosure of information from confidential data is a challenging statistical problem. A common approach considers the generation of synthetic data, to be disclosed instead of the original data. Efficient approaches ought to deal with the trade-off between reliability and confidentiality of the released data. Ultimately, the aim is to be able to reproduce as accurately as possible statistical analysis of the original data using the synthetic one. Bayesian networks is a model-based approach that can be used to parsimoniously estimate the underlying distribution of the original data and generate synthetic datasets. These ought to not only approximate the results of analyses with the original data but also robustly quantify the uncertainty involved in the approximation. This paper proposes a fully Bayesian approach to generate and analyze synthetic data based on the posterior predictive distribution of statistics of the synthetic data, allowing for efficient uncertainty quantification. The methodology makes use of probability properties of the model to devise a computationally efficient algorithm to obtain the target predictive distributions via Monte Carlo. Model parsimony is handled by proposing a general class of penalizing priors for Bayesian network models. Finally, the efficiency and applicability of the proposed methodology is empirically investigated through simulated and real examples.
翻译:暂无翻译