Large datasets in machine learning often contain missing data, which necessitates the imputation of missing data values. In this work, we are motivated by network traffic classification, where traditional data imputation methods do not perform well. We recognize that no existing method directly accounts for classification accuracy during data imputation. Therefore, we propose a joint data imputation and data classification method, termed generative adversarial classification network (GACN), whose architecture contains a generator network, a discriminator network, and a classification network, which are iteratively optimized toward the ultimate objective of classification accuracy. For the scenario where some data samples are unlabeled, we further propose an extension termed semi-supervised GACN (SSGACN), which is able to use the partially labeled data to improve classification accuracy. We conduct experiments with real-world network traffic data traces, which demonstrate that GACN and SS-GACN can more accurately impute data features that are more important for classification, and they outperform existing methods in terms of classification accuracy.
翻译:机器学习中的大型数据集通常包含缺失的数据,这需要对缺失的数据值进行插补。在本研究中,我们受到网络流量分类的启发,传统的数据插补方法在这方面并不表现出色。我们认识到,目前没有一种现有的方法直接考虑了分类准确性的数据插补。因此,我们提出了一种联合数据插补和数据分类的方法,称为生成对抗分类网络(GACN)其架构包含了一组生成器网络、辨别器网络和分类网络,它们在迭代优化时朝着最终的分类准确性目标进行。对于一些数据样本未标记的情况,我们进一步提出了一种扩展版本,称为半监督GACN(SSGACN),它能够利用部分标记数据来提高分类准确性。我们使用实际的网络流量数据跟踪进行了实验,结果表明,GACN和SSGACN能够更准确地插补更重要的分类数据特征,并在分类准确性方面优于现有的方法。