混淆:通过将半悬浮和抵触学习相结合,对恶意侦测和有数据缺陷的恶意分类 (Malfustection: Obfuscated Malware Detection and Malware Classification with Data Shortage by Combining Semi-Supervised and Contrastive Learning)

With the advent of new technologies, using various formats of digital gadgets is becoming widespread. In today's world, where everyday tasks are inevitable without technology, this extensive use of computers paves the way for malicious activity. As a result, it is important to provide solutions to defend against these threats. Malware is one of the well-known and widely used means utilized for doing destructive activities by malicious attackers. Producing malware from scratch is somewhat difficult, so attackers tend to obfuscate existing malware and prepare it to become an unrecognizable program. Since creating new malware from an old one using obfuscation is a creative task, there are some drawbacks to identifying obfuscated malwares. In this research, we propose a solution to overcome this problem by converting the code to an image in the first step and then using a semi-supervised approach combined with contrastive learning. In this case, an obfuscation in the malware bytecode corresponds to an augmentation in the image. Hence, by utilizing meaningful augmentations, which simulate some obfuscation changes and combine them to generate complex ambiguity procedures, our proposed solution is able to construct, learn, and detect a wide range of obfuscations. This work addresses two issues: 1) malware classification despite the data deficiency and 2) obfuscated malware detection by training on non-obfuscated malwares. According to the results, the proposed method overcomes the data shortage problem in malware classification, as its accuracy is 90.1% when just 10% of data is used for training the model. Moreover, training on basic malwares without obfuscation achieved 96.21 percent accuracy in detecting obfuscated malware.

翻译：随着新技术的到来,使用各种数字工具工具的出现正在变得很普遍。在当今的世界中,日常任务在没有技术的情况下是不可避免的,大量使用计算机为恶意活动铺平了道路。因此,必须提供防御这些威胁的办法。恶意软件是众所周知和广泛使用的手段之一,用于恶意攻击者从事破坏性活动。从零开始产生恶意软件有些困难,因此攻击者往往混淆现有的恶意软件,准备使其成为一个无法辨认的准确性程序。在当今世界,日常任务在没有技术的情况下是不可避免的,大量使用计算机为恶意活动铺平了道路。因此,由于从旧软件中创建新的恶意软件是一种创造性的任务,因此在识别难以辨认的恶意软件方面有一些退缩。在这项研究中,我们提出了一个解决问题的办法,将代码转换为第一步中的形象,然后使用半监督的方法,结合对比性学习。在本案中,恶意软件的错误代码与10种难以辨认的读取程序相匹配。因此,在模拟某些模糊性培训过程中,使用有意义的放大的错误软件,在模拟某些不理解性培训方法中可以模拟某些数据。