Data augmentation has been rare in the cyber security domain due to technical difficulties in altering data in a manner that is semantically consistent with the original data. This shortfall is particularly onerous given the unique difficulty of acquiring benign and malicious training data that runs into copyright restrictions, and that institutions like banks and governments receive targeted malware that will never exist in large quantities. We present MARVOLO, a binary mutator that programmatically grows malware (and benign) datasets in a manner that boosts the accuracy of ML-driven malware detectors. MARVOLO employs semantics-preserving code transformations that mimic the alterations that malware authors and defensive benign developers routinely make in practice , allowing us to generate meaningful augmented data. Crucially, semantics-preserving transformations also enable MARVOLO to safely propagate labels from original to newly-generated data samples without mandating expensive reverse engineering of binaries. Further, MARVOLO embeds several key optimizations that keep costs low for practitioners by maximizing the density of diverse data samples generated within a given time (or resource) budget. Experiments using wide-ranging commercial malware datasets and a recent ML-driven malware detector show that MARVOLO boosts accuracies by up to 5%, while operating on only a small fraction (15%) of the potential input binaries.
翻译:在网络安全领域,由于技术困难,难以以与原始数据一致的方式改变数据,数据增强在网络安全领域是少有的。这一不足特别繁重,因为获取无害和恶意培训数据有特殊困难,难以获得进入版权限制的良性和恶意培训数据,银行和政府等机构接收目标恶意软件,而这些软件将大量存在。我们介绍了MARVOLO,这是一个二进制变异器,其程序上培养恶意软件(和良性)数据集的方式提高了由ML驱动的恶意软件探测器的准确性。MARVOLO采用语义保存代码转换,以模拟恶意软件作者和防御性良性良性开发者在实践中经常作出的改变,从而使我们能够产生有意义的强化数据。关键是,语义保存变换也使MARVOLO能够安全地传播原版到新生成的数据样本的标签,而不需要花费昂贵的反向工程。此外,MARVOLO将一些关键优化,通过最大限度地增加在特定时间(或资源)预算内生成的多种数据样本的密度,从而保持成本低廉。使用宽度的商用恶意数据样本实验,而只是用MLOVOLO 进行微的磁盘操作,同时显示最近的磁盘的磁盘的磁盘。