The generation of synthetic data is useful in multiple aspects, from testing applications to benchmarking to privacy preservation. Generating the links between relations, subject to cardinality constraints (CCs) and integrity constraints (ICs) is an important aspect of this problem. Given instances of two relations, where one has a foreign key dependence on the other and is missing its foreign key ($FK$) values, and two types of constraints: (1) CCs that apply to the join view and (2) ICs that apply to the table with missing $FK$ values, our goal is to impute the missing $FK$ values such that the constraints are satisfied. We provide a novel framework for the problem based on declarative CCs and ICs. We further show that the problem is NP-hard and propose a novel two-phase solution that guarantees the satisfaction of the ICs. Phase I yields an intermediate solution accounting for the CCs alone, and relies on a hybrid approach based on CC types. For one type, the problem is modeled as an Integer Linear Program. For the others, we describe an efficient and accurate solution. We then combine the two solutions. Phase II augments this solution by incorporating the ICs and uses a coloring of the conflict hypergraph to infer the values of the $FK$ column. Our extensive experimental study shows that our solution scales well when the data and number of constraints increases. We further show that our solution maintains low error rates for the CCs.
翻译:合成数据的生成在许多方面都是有益的,从测试应用程序到制定隐私保护基准,从测试应用程序到制定隐私保护基准。在受基本限制(CCs)和完整性限制(ICs)的限制的情况下,建立关系之间的联系是这一问题的一个重要方面。鉴于两种关系的情况,一种是外国关键对另一个关键依赖,而另一种是缺少外国关键值(FK$),还有两种限制:(1) 适用于合并观点的CCs 和(2) 适用于缺少美元价值的表格的ICs,我们的目标是估算缺失的FK美元值,这样就满足了制约。我们为基于宣言性CCs和ICs的问题提供了一个新的框架。我们进一步表明,问题是一个硬的问题,提出了新的两阶段解决办法,保证ICs满意度。 第一阶段是仅计算CCs的中间解决方案,依靠基于CC型的混合方法。 一类是,问题被建为Integer线性程序。对于其他人来说,我们描述一个高效和准确的解决方案。我们随后将两个解决方案结合起来,即PPNP-硬度-C-C-C版本的解决方案,通过高比标点来显示我们高的C-C-C-C-级解决方案的升级的汇率,从而展示了我们的高度解决方案。