基于大型语言模型的恶意软件分析语义预处理方法 (Semantic Preprocessing for LLM-based Malware Analysis)

In a context of malware analysis, numerous approaches rely on Artificial Intelligence to handle a large volume of data. However, these techniques focus on data view (images, sequences) and not on an expert's view. Noticing this issue, we propose a preprocessing that focuses on expert knowledge to improve malware semantic analysis and result interpretability. We propose a new preprocessing method which creates JSON reports for Portable Executable files. These reports gather features from both static and behavioral analysis, and incorporate packer signature detection, MITRE ATT\&CK and Malware Behavior Catalog (MBC) knowledge. The purpose of this preprocessing is to gather a semantic representation of binary files, understandable by malware analysts, and that can enhance AI models' explainability for malicious files analysis. Using this preprocessing to train a Large Language Model for Malware classification, we achieve a weighted-average F1-score of 0.94 on a complex dataset, representative of market reality.

翻译：在恶意软件分析领域，众多方法依赖人工智能处理海量数据。然而，现有技术主要关注数据视图（如图像、序列），而忽视了专家视角。针对这一问题，我们提出一种聚焦专家知识的预处理方法，以提升恶意软件语义分析的效果与结果可解释性。我们设计了一种新型预处理方法，可为可移植可执行文件生成JSON格式报告。这些报告汇集了静态分析与行为分析的特征，并整合了加壳签名检测、MITRE ATT&CK框架及恶意软件行为目录知识。该预处理旨在构建二进制文件的语义表征，使其既能为恶意软件分析师理解，又能增强人工智能模型在恶意文件分析中的可解释性。通过采用此预处理方法训练用于恶意软件分类的大型语言模型，我们在代表市场真实情况的复杂数据集上取得了0.94的加权平均F1分数。