Traditional synthetic data generation methods rely on model-based approaches that tune the parameters of a model rather than focusing on the structure of the data itself. In contrast, Scagnostics is an exploratory graphical method that captures the structure of bivariate data using graph-theoretic measures. This paper presents a novel data generation method, scatteR, that uses Scagnostics measurements to control the characteristics of the generated dataset. By using an iterative Generalized Simulated Annealing optimizer, scatteR finds the optimal arrangement of data points that minimizes the distance between current and target Scagnostics measurements. The results demonstrate that scatteR can generate 50 data points in under 30 seconds with an average Root Mean Squared Error of 0.05, making it a useful pedagogical tool for teaching statistical methods. Overall, scatteR provides an entry point for generating datasets based on the characteristics of instance space, rather than relying on model-based simulations.
翻译:传统的合成数据生成方法依赖于调整模型参数而非专注于数据本身的结构。相反,Scagnostics 是一种探索性图形方法,它使用图论度量来捕捉双变量数据的结构。本文提出了一种新的数据生成方法 scatteR,它使用 Scagnostics 度量来控制生成的数据集的特征。通过使用迭代的广义模拟退火优化器,scatteR 找到了最优的数据点排列方式,以最小化当前和目标 Scagnostics 度量之间的距离。结果表明,scatteR 可以在不到30秒的时间内生成50个数据点,其平均根均方误差为0.05,这使它成为一种有用的统计方法教学工具。总的来说,scatteR 提供了一种生成基于实例空间特征的数据集的入口,而不是依赖于基于模型的模拟。