Malicious email attachments are a growing delivery vector for malware. While machine learning has been successfully applied to portable executable (PE) malware detection, we ask, can we extend similar approaches to detect malware across heterogeneous file types commonly found in email attachments? In this paper, we explore the feasibility of applying machine learning as a static countermeasure to detect several types of malicious email attachments including Microsoft Office documents and Zip archives. To this end, we collected a dataset of over 5 million malicious/benign Microsoft Office documents from VirusTotal for evaluation as well as a dataset of benign Microsoft Office documents from the Common Crawl corpus, which we use to provide more realistic estimates of thresholds for false positive rates on in-the-wild data. We also collected a dataset of approximately 500k malicious/benign Zip archives, which we scraped using the VirusTotal service, on which we performed a separate evaluation. We analyze predictive performance of several classifiers on each of the VirusTotal datasets using a 70/30 train/test split on first seen time, evaluating feature and classifier types that have been applied successfully in commercial antimalware products and R&D contexts. Using deep neural networks and gradient boosted decision trees, we are able to obtain ROC curves with > 0.99 AUC on both Microsoft Office document and Zip archive datasets. Discussion of deployment viability in various antimalware contexts is provided.
翻译:恶意电子邮件附加物是恶意软件不断增长的送货矢量。 虽然机器学习已经成功地应用于便携式可执行(PE)恶意软件检测, 我们问我们, 我们能否推广类似方法来检测电子邮件附加物中常见的多种类型文件的恶意软件? 在本文中, 我们探索了将机器学习作为一种静态的应对措施的可行性, 以检测多种恶意电子邮件附加物, 包括微软办公室文件和Zip档案。 为此, 我们收集了一套500多万张来自病毒Toltal的恶意/恶性微软办公室文件数据集, 供评估使用, 以及共同 CrawLamp 的良性微软办公室文件数据集数据集数据集, 我们用这些数据来提供更符合现实的估计数, 用于检测在电子邮件附加数据时的假正率。 我们还收集了一套大约500公里的恶意/金色 Zip 档案数据集, 使用病毒服务器服务, 进行单独评估。 我们用70/30次的防腐蚀性培训/测试分析了每个病毒数据集的预测性性能, 评估功能和分类文件类型, 利用深度部署战略数据库, 成功应用了智能数据库 和智能数据库 。