Individual-level data (microdata) that characterizes a population, is essential for studying many real-world problems. However, acquiring such data is not straightforward due to cost and privacy constraints, and access is often limited to aggregated data (macro data) sources. In this study, we examine synthetic data generation as a tool to extrapolate difficult-to-obtain high-resolution data by combining information from multiple easier-to-obtain lower-resolution data sources. In particular, we introduce a framework that uses a combination of univariate and multivariate frequency tables from a given target geographical location in combination with frequency tables from other auxiliary locations to generate synthetic microdata for individuals in the target location. Our method combines the estimation of a dependency graph and conditional probabilities from the target location with the use of a Gaussian copula to leverage the available information from the auxiliary locations. We perform extensive testing on two real-world datasets and demonstrate that our approach outperforms prior approaches in preserving the overall dependency structure of the data while also satisfying the constraints defined on the different variables.
翻译:个人数据(微观数据)是人口特征,对于研究许多现实世界问题至关重要。然而,由于成本和隐私限制,获取这些数据并非直截了当,而且获取数据的机会往往仅限于综合数据(宏观数据)来源。在本研究中,我们研究合成数据生成,作为综合从多个较容易获得低分辨率数据源获得的信息,以推断难以获取的高分辨率数据的工具。我们特别采用了一个框架,将来自特定目标地理位置的单象牙和多变量频率表结合起来,并结合来自其他辅助地点的频率表,为目标地点的个人生成合成微观数据。我们的方法是将依赖性图的估计和从目标地点的有条件概率与利用从辅助地点获得的信息的高斯阳极结合起来。我们对两个真实世界数据集进行了广泛的测试,并表明我们的方法在保持数据的总体依赖性结构方面超越了先前的做法,同时满足了对不同变量规定的限制。