Synthetic data generation has been widely adopted in software testing, data privacy, imbalanced learning, and artificial intelligence explanation. In all such contexts, it is crucial to generate plausible data samples. A common assumption of approaches widely used for data generation is the independence of the features. However, typically, the variables of a dataset depend on one another, and these dependencies are not considered in data generation leading to the creation of implausible records. The main problem is that dependencies among variables are typically unknown. In this paper, we design a synthetic dataset generator for tabular data that can discover nonlinear causalities among the variables and use them at generation time. State-of-the-art methods for nonlinear causal discovery are typically inefficient. We boost them by restricting the causal discovery among the features appearing in the frequent patterns efficiently retrieved by a pattern mining algorithm. We design a framework for generating synthetic datasets with known causalities to validate our proposal. Broad experimentation on many synthetic and real datasets with known causalities shows the effectiveness of the proposed method.
翻译:合成数据生成在软件测试、数据隐私、数据学习不平衡和人工智能解释中被广泛采用。在所有这些情况下,生成可信的数据样本至关重要。数据生成广泛采用的方法的共同假设是特性的独立性。然而,通常情况下,数据集的变量是相互依存的,在生成不可信的记录的数据过程中不考虑这些依赖性。主要问题是变量之间的依赖性通常不为人知。在本文中,我们设计了一个用于表格数据的合成数据集生成器,该数据集能够发现变量中的非线性因果关系,并在生成时使用这些数据。非线性因果关系发现的最新方法通常效率低下。我们通过限制模式采矿算法所有效检索的经常模式中出现的特征的因果关系发现,以此来推动这些特性。我们设计一个框架,用于生成已知因果关系的合成数据集,以验证我们的提议。对许多已知因果关系的合成和真实数据集进行广泛的实验,显示了拟议方法的有效性。