Machine learning algorithms are widely used in the area of malware detection. With the growth of sample amounts, training of classification algorithms becomes more and more expensive. In addition, training data sets may contain redundant or noisy instances. The problem to be solved is how to select representative instances from large training data sets without reducing the accuracy. This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF). The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset. Each subset consists of instances that have the same nearest enemy. As a result, the PIF algorithm is fast since subsets are processed independently of each other using parallel computation. We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples. The feature set was extracted using static analysis, and it includes metadata from the portable executable file format. Our experimental results demonstrate that the proposed instance selection algorithm reduces the size of a training data set significantly with the only slightly decreased accuracy. The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage.
翻译:机器学习算法在恶意软件检测领域被广泛使用。随着样本数量的增加,分类算法的培训越来越昂贵。此外,培训数据集可能包含冗余或吵闹的情况。需要解决的问题是如何从大型培训数据集中选择代表性实例而不降低准确性。这项工作提出了一个新的平行案例选择算法,称为平行筛选程序(PIF)。该算法的主要设想是将数据集分为非重叠的子集,涵盖整个数据集,并对每个子集采用过滤程序。每个子集都由最接近敌人的事例组成。因此,PIF算法是快速的,因为子集是使用平行计算独立处理的。我们将PIF算法与由50万个恶意和良性样本组成的大型数据集中的几种最新实例选择算法进行比较。特征集是用静态分析提取的,其中包括从便携式可执行文件格式中提取的元数据。我们的实验结果显示,拟议的实例选择算法大大缩小了所设定的培训数据的规模,只有略微降低准确性。PIF算法在平均百分比分类中使用了现有百分比方法。