Data simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of variables in both inference and simulation settings. However, while modern machine learning is applied to data of an increasingly complex nature, DAG-based simulation frameworks are still confined to settings with relatively simple variable types and functional forms. We here present DagSim, a Python-based framework for DAG-based data simulation without any constraints on variable types or functional relations. A succinct YAML format for defining the simulation model structure promotes transparency, while separate user-provided functions for generating each variable based on its parents ensure simulation code modularization. We illustrate the capabilities of DagSim through use cases where metadata variables control shapes in an image and patterns in bio-sequences.
翻译:数据模拟是机器学习和因果推断的基础,因为它允许在完全控制地面真相的情况下探索各种假设情景和评估各种方法。直接循环图(DAGs)是用来在推论和模拟环境中对收集的变量进行编码的可靠结构的完善的。然而,虽然现代机器学习应用于日益复杂的数据,但基于DAG的模拟框架仍然局限于具有相对简单可变类型和功能形式的环境。我们在这里介绍DagSim,一个基于DAG的数据模拟的Python框架,不受可变类型或功能关系的限制。一个用于界定模拟模型结构的简明的YAML格式促进了透明度,同时根据父母生成每个变量的用户提供的单独功能确保模拟代码模块化。我们通过使用元变量控制在生物序列中形成图像和模式的案例来说明DAGSim的能力。