High-quality data scarcity hinders malware detection, limiting ML performance. We introduce MalDataGen, an open-source modular framework for generating high-fidelity synthetic tabular data using modular deep learning models (e.g., WGAN-GP, VQ-VAE). Evaluated via dual validation (TR-TS/TS-TR), seven classifiers, and utility metrics, MalDataGen outperforms benchmarks like SDV while preserving data utility. Its flexible design enables seamless integration into detection pipelines, offering a practical solution for cybersecurity applications.
翻译:高质量数据的稀缺性阻碍了恶意软件检测,限制了机器学习性能。我们提出了MalDataGen,这是一个开源的模块化框架,利用模块化深度学习模型(如WGAN-GP、VQ-VAE)生成高保真合成表格数据。通过双重验证(TR-TS/TS-TR)、七种分类器及效用指标评估,MalDataGen在保持数据效用的同时,性能优于SDV等基准方法。其灵活设计支持无缝集成到检测流程中,为网络安全应用提供了实用解决方案。