With the adoption of multiple digital devices in everyday life, the cyber-attack surface has increased. Adversaries are continuously exploring new avenues to exploit them and deploy malware. On the other hand, detection approaches typically employ hashing-based algorithms such as SSDeep, TLSH, and IMPHash to capture structural and behavioural similarities among binaries. This work focuses on the analysis and evaluation of these techniques for clustering malware samples using the K-means algorithm. More specifically, we experimented with established malware families and traits and found that TLSH and IMPHash produce more distinct, semantically meaningful clusters, whereas SSDeep is more efficient for broader classification tasks. The findings of this work can guide the development of more robust threat-detection mechanisms and adaptive security mechanisms.
翻译:随着日常生活中多种数字设备的普及,网络攻击面不断扩大。攻击者持续探索新途径以利用这些设备并部署恶意软件。另一方面,检测方法通常采用基于哈希的算法(如SSDeep、TLSH和IMPHash)来捕获二进制文件间的结构与行为相似性。本研究重点分析和评估这些技术在使用K-means算法进行恶意软件样本聚类时的表现。具体而言,我们通过已建立的恶意软件家族和特征进行实验,发现TLSH和IMPHash能产生更具区分度、语义更明确的聚类,而SSDeep在更广泛的分类任务中效率更高。本研究的发现可为开发更鲁棒的威胁检测机制和自适应安全机制提供指导。