Malware developers use combinations of techniques such as compression, encryption, and obfuscation to bypass anti-virus software. Malware with anti-analysis technologies can bypass AI-based anti-virus software and malware analysis tools. Therefore, classifying pack files is one of the big challenges. Problems arise if the malware classifiers learn packers' features, not those of malware. Training the models with unintended erroneous data turn into poisoning attacks, adversarial attacks, and evasion attacks. Therefore, researchers should consider packing to build appropriate malware classifier models. In this paper, we propose a multi-step framework for classifying and identifying packed samples which consists of pseudo-optimal feature selection, machine learning-based classifiers, and packer identification steps. In the first step, we use the CART algorithm and the permutation importance to preselect important 20 features. In the second step, each model learns 20 preselected features for classifying the packed files with the highest performance. As a result, the XGBoost, which learned the features preselected by XGBoost with the permutation importance, showed the highest performance of any other experiment scenarios with an accuracy of 99.67%, an F1-Score of 99.46%, and an area under the curve (AUC) of 99.98%. In the third step, we propose a new approach that can identify packers only for samples classified as Well-Known Packed.
翻译:Malware 开发者使用压缩、加密和模糊等技术组合来绕过反病毒软件。 使用反分析技术的Malware 能够绕过基于AI的反病毒软件和恶意软件分析工具。 因此, 将包文件分类是一项重大挑战。 如果恶意软件分类者学习包装器的特性, 而不是恶意软件的特性, 就会出现问题。 以无意错误的数据对模型进行训练, 将无意错误的数据转换成中毒攻击、 对抗性攻击 和规避攻击 。 因此, 研究人员应当考虑包装以构建适当的恶意软件分类模型。 在本文中, 我们提出一个对包装样品进行分类和识别的多步骤框架, 其中包括假最佳功能选择、 机器学习分类器和包装器识别工具。 在第一步, 我们使用 CART 算法和变异性重要性来预选重要的20个特性。 在第二步, 每个模型学习20个预选的特性, 以最高性能进行分类。 因此, XGBoost 能够学习由 XGBoost 所预选的功能, 以及真切的重要性, 显示99- brideal true a recreal rodefistration a ridefistration of new ride rideal a ride a ride a ride ride ride ride ride ride a n n n nam- frif ride.