File fragment classification (FFC) on small chunks of memory is essential in memory forensics and Internet security. Existing methods mainly treat file fragments as 1d byte signals and utilize the captured inter-byte features for classification, while the bit information within bytes, i.e., intra-byte information, is seldom considered. This is inherently inapt for classifying variable-length coding files whose symbols are represented as the variable number of bits. Conversely, we propose Byte2Image, a novel data augmentation technique, to introduce the neglected intra-byte information into file fragments and re-treat them as 2d gray-scale images, which allows us to capture both inter-byte and intra-byte correlations simultaneously through powerful convolutional neural networks (CNNs). Specifically, to convert file fragments to 2d images, we employ a sliding byte window to expose the neglected intra-byte information and stack their n-gram features row by row. We further propose a byte sequence \& image fusion network as a classifier, which can jointly model the raw 1d byte sequence and the converted 2d image to perform FFC. Experiments on FFT-75 dataset validate that our proposed method can achieve notable accuracy improvements over state-of-the-art methods in nearly all scenarios. The code will be released at https://github.com/wenyang001/Byte2Image.
翻译:文件片段分类(FFC)在内存取证和互联网安全中是必须的。现有方法主要将文件片段视为1维字节信号,并利用捕获的字节间特征进行分类,而字节内的位信息即字节内信息很少被考虑。这对于对可变长度编码文件进行分类是不合适的,因为文件中的符号被表示为可变数量的位。相反,我们提出了Byte2Image,这是一种新的数据增强技术,将被忽视的字节内信息引入到文件片段中,并将其重新视为2D灰度图像,从而通过强大的卷积神经网络(CNN)同时捕获字节间和字节内的相关性。具体而言,为了将文件片段转换为2D图像,我们采用滑动字节窗口来暴露被忽视的字节内信息,并逐行堆叠其n-gram特征。我们进一步提出了一个字节序列和图像融合网络作为分类器,该网络可以联合建模原始的1维字节序列和转换后的2D图像来执行FFC。在FFT-75数据集上的实验证明,与最先进的方法相比,我们提出的方法在几乎所有情况下都可以实现显著的准确度提高。代码将发布在 https://github.com/wenyang001/Byte2Image。