The recent success and proliferation of machine learning and deep learning have provided powerful tools, which are also utilized for encrypted traffic analysis, classification, and threat detection in computer networks. These methods, neural networks in particular, are often complex and require a huge corpus of training data. Therefore, this paper focuses on collecting a large up-to-date dataset with almost 200 fine-grained service labels and 140 million network flows extended with packet-level metadata. The number of flows is three orders of magnitude higher than in other existing public labeled datasets of encrypted traffic. The number of service labels, which is important to make the problem hard and realistic, is four times higher than in the public dataset with the most class labels. The published dataset is intended as a benchmark for identifying services in encrypted traffic. Service identification can be further extended with the task of "rejecting" unknown services, i.e., the traffic not seen during the training phase. Neural networks offer superior performance for tackling this more challenging problem. To showcase the dataset's usefulness, we implemented a neural network with a multi-modal architecture, which is the state-of-the-art approach, and achieved 97.04% classification accuracy and detected 91.94% of unknown services with 5% false positive rate.
翻译:最近机器学习和深层次学习的成功和扩散提供了强大的工具,这些工具也用于计算机网络中的加密交通分析、分类和威胁探测。这些方法,特别是神经网络,往往复杂,需要大量培训数据。因此,本文件侧重于收集大型的最新数据集,有近200个微微微分类服务标签,1.4亿网络流量,通过包级元数据扩展了近200个微细分类服务标签和1.4亿网络流量。流动数量比其他现有的公开标签的加密交通加密数据集高出3个数量级。对于使问题变得硬性和现实性十分重要的服务标签数量比大多数类标签的公共数据集高出4倍。公布的数据集旨在作为确定加密交通服务的基准。服务识别可以随着“反馈”未知服务的任务,即培训阶段看不到的流量而进一步扩大。神经网络为处理这一更具挑战性的问题提供了优异的性表现。为了展示数据集的有用性,我们安装了一个有多种模式结构的神经网络,它对于使问题变得困难和现实性,对大多数类标签都很重要,比公共数据集高4倍。已公布的数据集旨在确定加密的精确度为5%。