Modern synthetic data generators consist of model-based methods where the focus is primarily on tuning the parameters of the model and not on specifying the structure of the data itself. Scagnostics is an exploratory graphical method, capable of encapsulating the structure of bivariate data through graph-theoretic measures. An inverse scagnostic measure would therefore provide an entry point to generate datasets based on the characteristics of instance space rather than a model-based simulation approach. scatteR is a novel data generation method with controllable characteristics based on scagnostic measurements. We have used a Generalized Simulated Annealing optimizer iteratively to discover the optimal arrangement of data points in each iteration that minimizes the distance between the current and target measurements. Generally, as a pedagogical tool, scatteR can be used to generate datasets to teach statistical methods. Based on the results of this study, scatteR is capable of generating 50 data points in under 30 seconds with a 0.05 Root Mean Squared Error on average.
翻译:现代合成数据生成器由基于模型的方法组成,其重点主要是调整模型参数,而不是指定数据本身的结构。 Scagnostics是一种探索性图形方法,能够通过图形理论测量将双变量数据结构封装成双变量数据结构。因此,反向剖析度测量可提供一个切入点,以便根据实例空间的特性而不是基于模型的模拟方法生成数据集。scatteR是一种新型数据生成方法,具有基于剖析测量的可控特性。我们使用了通用模拟安纳林优化迭接式模型,以发现每个迭代中数据点的最佳安排,从而最大限度地减少当前测量与目标测量之间的距离。一般来说,作为教学工具,可使用“catteR”生成数据集,用于教授统计方法。根据这项研究的结果,ScatteR能够在30秒以内生成50个数据点,平均为0.05根势平方误差。