Due to the growing rise of cyber attacks in the Internet, flow-based data sets are crucial to increase the performance of the Machine Learning (ML) components that run in network-based intrusion detection systems (IDS). To overcome the existing network traffic data shortage in attack analysis, recent works propose Generative Adversarial Networks (GANs) for synthetic flow-based network traffic generation. Data privacy is appearing more and more as a strong requirement when processing such network data, which suggests to find solutions where synthetic data can fully replace real data. Because of the ill-convergence of the GAN training, none of the existing solutions can generate high-quality fully synthetic data that can totally substitute real data in the training of IDS ML components. Therefore, they mix real with synthetic data, which acts only as data augmentation components, leading to privacy breaches as real data is used. In sharp contrast, in this work we propose a novel deterministic way to measure the quality of the synthetic data produced by a GAN both with respect to the real data and to its performance when used for ML tasks. As a byproduct, we present a heuristic that uses these metrics for selecting the best performing generator during GAN training, leading to a stopping criterion. An additional heuristic is proposed to select the best performing GANs when different types of synthetic data are to be used in the same ML task. We demonstrate the adequacy of our proposal by generating synthetic cryptomining attack traffic and normal traffic flow-based data using an enhanced version of a Wasserstein GAN. We show that the generated synthetic network traffic can completely replace real data when training a ML-based cryptomining detector, obtaining similar performance and avoiding privacy violations, since real data is not used in the training of the ML-based detector.
翻译:由于互联网上网络攻击的增加,流动数据组对于提高基于网络的入侵探测系统(IDS)运行的机器学习(ML)组件的性能至关重要。为了克服攻击分析中现有的网络交通数据短缺问题,最近的工作提议,为合成流动网络生成合成网络通信,General Adversarial网络(GANs)为合成流动网络生成。数据隐私在处理这种网络数据时越来越成为一项强烈的要求,这表明要找到合成数据能够完全取代真实数据的解决方案。由于GAN培训的不一致性能,任何现有解决方案都无法产生高质量的完全合成数据,从而完全取代以网络为基础的入侵探测系统(IDS ML)组件培训中的真实数据。因此,它们与合成数据(GANs)结合,仅仅作为数据增强部分,在使用真实数据时导致隐私破坏。 与此形成鲜明对照的是,我们提出了一种新的确定性方法,用以衡量GAN公司所生成的合成数据的质量,在使用真实数据时取代了ML任务时的性能性能。作为产品,我们使用一种肝力模型,在使用一种肝力培训过程中,在使用一种更高的数据模型中,在使用高级数据中进行一种加速性数据测试时,在使用这些测试时,在使用GAN标准中进行一种高级测试时,在使用一种高级数据中进行一种高级数据周期性数据的测试时,在使用一种高级数据中,在使用一种高级数据中进行一种高级数据周期性数据周期性数据,在使用一种高级数据中进行一种进行一种高级数据转换式的数据。