Software vulnerabilities are usually caused by design flaws or implementation errors, which could be exploited to cause damage to the security of the system. At present, the most commonly used method for detecting software vulnerabilities is static analysis. Most of the related technologies work based on rules or code similarity (source code level) and rely on manually defined vulnerability features. However, these rules and vulnerability features are difficult to be defined and designed accurately, which makes static analysis face many challenges in practical applications. To alleviate this problem, some researchers have proposed to use neural networks that have the ability of automatic feature extraction to improve the intelligence of detection. However, there are many types of neural networks, and different data preprocessing methods will have a significant impact on model performance. It is a great challenge for engineers and researchers to choose a proper neural network and data preprocessing method for a given problem. To solve this problem, we have conducted extensive experiments to test the performance of the two most typical neural networks (i.e., Bi-LSTM and RVFL) with the two most classical data preprocessing methods (i.e., the vector representation and the program symbolization methods) on software vulnerability detection problems and obtained a series of interesting research conclusions, which can provide valuable guidelines for researchers and engineers. Specifically, we found that 1) the training speed of RVFL is always faster than BiLSTM, but the prediction accuracy of Bi-LSTM model is higher than RVFL; 2) using doc2vec for vector representation can make the model have faster training speed and generalization ability than using word2vec; and 3) multi-level symbolization is helpful to improve the precision of neural network models.
翻译:软件的弱点通常是设计缺陷或实施错误造成的,这些缺陷或实施错误可能被用来损害系统的安全。目前,最常用的检测软件弱点的方法是静态分析。大多数相关技术工作基于规则或代码相似(源代码水平),并依靠人工定义的脆弱性特征。然而,这些规则和脆弱性特征难以准确界定和设计,这使得静态分析在实际应用中面临许多挑战。为缓解这一问题,一些研究人员提议使用具有自动特征提取能力的神经网络,以提高检测智能。然而,许多类型的神经网络,以及不同的数据处理预处理方法将对模型性能产生重大影响。对于工程师和研究人员来说,根据规则或代码相似性(源代码水平)进行相关的技术工作,需要选择适当的神经网络和数据预处理方法。然而,为了解决这一问题,我们进行了广泛的实验,以测试两种最典型的神经网络(即Bi-LSTM和RVFLL)的性能。为了提高检测速度,两种最古典的数据预处理方法(即矢量表示器和程序符号化速度,不同的数据处理方法对模型和RSLS)的精确度能力将产生显著影响。在软件的SLSLS的精确度测试和速度模型上,我们总是能够提供一个研究的系列,而能的研测测测的研测和速度是,而我们所测的研测测的研的研的研测和测的研的研的研程的研程的研程的精度标准。