In this paper, we explore the use of metric learning to embed Windows PE files in a low-dimensional vector space for downstream use in a variety of applications, including malware detection, family classification, and malware attribute tagging. Specifically, we enrich labeling on malicious and benign PE files using computationally expensive, disassembly-based malicious capabilities. Using these capabilities, we derive several different types of metric embeddings utilizing an embedding neural network trained via contrastive loss, Spearman rank correlation, and combinations thereof. We then examine performance on a variety of transfer tasks performed on the EMBER and SOREL datasets, demonstrating that for several tasks, low-dimensional, computationally efficient metric embeddings maintain performance with little decay, which offers the potential to quickly retrain for a variety of transfer tasks at significantly reduced storage overhead. We conclude with an examination of practical considerations for the use of our proposed embedding approach, such as robustness to adversarial evasion and introduction of task-specific auxiliary objectives to improve performance on mission critical tasks.
翻译:在本文中,我们探索了将Windows PE文件嵌入低维矢量空间的衡量学习方法,用于下游应用,包括恶意软件检测、家庭分类和恶意软件属性标记等各种应用。具体地说,我们利用计算成本昂贵、拆散的恶意能力,丰富恶意和良性 PE文件的标签。利用这些能力,我们利用通过对比性损失、Spearman等级关系和组合培训的嵌入神经网络,得出了几种不同类型的指标嵌入。然后,我们审查了在EMBER和SOREL数据集上完成的各种转移任务的业绩,表明一些任务,即低维度、计算效率高的标准嵌入的性能与微衰变,这为在大幅降低存储间接费用时迅速重新处理各种转移任务提供了潜力。我们最后,我们研究了使用我们提议的嵌入方法的实际考虑因素,例如强力规避对抗和引入特定任务辅助目标,以改进任务关键任务的业绩。