Accumulation of corporate data in the cloud has attracted more enterprise applications to the cloud creating data gravity. As a consequence, network traffic has become more cloud centric. This increase in cloud centric traffic poses new challenges in designing learning systems for streaming data due to class imbalance. The number of classes plays a vital role in the accuracy of the classifiers built from the data streams. In this paper, we present a vector quantization-based sampling method, which substantially reduces the class imbalance in data streams. We demonstrate its effectiveness by conducting experiments on network traffic and anomaly dataset with commonly used ML model building methods; Multilayered Perceptron on TensorFlow backend, Support Vector Machines, K-Nearest Neighbour, and Random Forests. We built models using parallel processing, batch processing, and randomly selecting samples. We show that the accuracy of classification models improves when the data streams are pre-processed with our method. We used out of the box hyper-parameters of these classifiers and auto sklearn for hyperparameter optimization.
翻译:云层中公司数据的累积吸引了更多的企业对云层的应用,从而产生了数据重力。因此,网络流量变得更加以云为中心。由于阶级不平衡,以云为中心的流量的增加给设计流数据学习系统带来了新的挑战。班级数量在从数据流中构建的分类器的准确性方面发挥着关键作用。在本文件中,我们提出了一个基于矢量的采样方法,该方法大大降低了数据流中的分类不平衡。我们通过用常用的 ML 模型构建方法对网络流量和异常数据集进行实验,证明了其有效性; 在 TensorFlowend, 支持矢量机器, K-Nearest 邻居和随机森林中,多层受控的受访者。我们用平行处理、批量处理和随机选择样本的方式建立了模型。我们表明,当数据流以我们的方法预先处理时,分类模型的准确性会提高。我们用这些分类器和自动滑动器的盒式超参数来进行超光度优化。