In many contexts, missing data and disclosure control are ubiquitous and challenging issues. In particular at statistical agencies, the respondent-level data they collect from surveys and censuses can suffer from high rates of missingness. Furthermore, agencies are obliged to protect respondents' privacy when publishing the collected data for public use. The NPBayesImputeCat R package, introduced in this paper, provides routines to i) create multiple imputations for missing data, and ii) create synthetic data for statistical disclosure control, for multivariate categorical data, with or without structural zeros. We describe the Dirichlet process mixture of products of multinomial distributions model used in the package, and illustrate various uses of the package using data samples from the American Community Survey (ACS). We also compare results of the missing data imputation to the mice R package and those of the synthetic data generation to the synthpop R package.
翻译:在许多情况下,缺失的数据和披露控制都是普遍和具有挑战性的问题,特别是在统计机构,它们从调查和人口普查中收集的答卷人一级数据可能会因高的缺失率而受到影响;此外,各机构在公布收集的数据供公众使用时有义务保护答卷人的隐私;本文介绍的NPBayesimputeCat R软件包提供了例行程序,以便一)为缺失的数据建立多重估算;二)为统计披露控制、多变量绝对数据以及结构零或没有结构零创建合成数据。我们描述了该软件包中使用的多元分布模型产品的Drichlet工艺混合物,并用美国社区调查的数据样本说明了该软件包的各种用途。我们还将缺失的数据估算结果与小鼠R软件包和合成数据生成结果与合成R软件包进行比较。