The goal of this paper is to describe a system for generating synthetic sequential data within the Synthetic data vault. To achieve this, we present the Sequential model currently in SDV, an end-to-end framework that builds a generative model for multi-sequence, real-world data. This includes a novel neural network-based machine learning model, conditional probabilistic auto-regressive (CPAR) model. The overall system and the model is available in the open source Synthetic Data Vault (SDV) library {https://github.com/sdv-dev/SDV}, along with a variety of other models for different synthetic data needs. After building the Sequential SDV, we used it to generate synthetic data and compared its quality against an existing, non-sequential generative adversarial network based model called CTGAN. To compare the sequential synthetic data against its real counterpart, we invented a new metric called Multi-Sequence Aggregate Similarity (MSAS). We used it to conclude that our Sequential SDV model learns higher level patterns than non-sequential models without any trade-offs in synthetic data quality.
翻译:本文的目的是描述在合成数据库内生成合成相继数据的系统。 为了实现这一点, 我们展示了SDV中目前的序列模型, 以及用于不同合成数据需要的多种其他模型。 在建立序列SDV后, 我们用它来生成合成数据, 并将其质量与现有的非序列基因对抗网络模型( CTGAN) 进行比较。 为了将相继合成数据与其真实对应数据进行比较, 我们发明了一种名为多序列聚合的新的计量( MSAST ) 。 我们用它来得出结论, 我们的序列SDV模型在不使用任何合成数据模型的情况下, 学习比非合成数据质量更高的模型。