In contrast to previous surveys, the present work is not focused on reviewing the datasets used in the network security field. The fact is that many of the available public labeled datasets represent the network behavior just for a particular time period. Given the rate of change in malicious behavior and the serious challenge to label, and maintain these datasets, they become quickly obsolete. Therefore, this work is focused on the analysis of current labeling methodologies applied to network-based data. In the field of network security, the process of labeling a representative network traffic dataset is particularly challenging and costly since very specialized knowledge is required to classify network traces. Consequently, most of the current traffic labeling methods are based on the automatic generation of synthetic network traces, which hides many of the essential aspects necessary for a correct differentiation between normal and malicious behavior. Alternatively, a few other methods incorporate non-experts users in the labeling process of real traffic with the help of visual and statistical tools. However, after conducting an in-depth analysis, it seems that all current methods for labeling suffer from fundamental drawbacks regarding the quality, volume, and speed of the resulting dataset. This lack of consistent methods for continuously generating a representative dataset with an accurate and validated methodology must be addressed by the network security research community. Moreover, a consistent label methodology is a fundamental condition for helping in the acceptance of novel detection approaches based on statistical and machine learning techniques.
翻译:与以往的调查相比,当前工作的重点不是审查网络安全领域使用的数据集,而是不侧重于审查网络安全领域使用的数据集,事实是,许多现有的公共标签数据集只是代表特定时期的网络行为。鉴于恶意行为的变化速度以及标签和保持这些数据集的严重挑战,它们很快过时。因此,这项工作的重点是分析目前用于网络数据的现行标签方法。在网络安全领域,贴标签具有代表性的网络流量数据集的过程特别困难和昂贵,因为分类网络跟踪需要非常专业的知识。因此,大多数目前的交通标签方法都基于自动生成合成网络痕迹,这掩盖了正确区分正常行为和恶意行为的许多必要基本方面。或者,其他一些方法将非专家用户纳入真实流量标签过程,同时借助视觉和统计工具。然而,在进行深入分析后,所有当前标签方法似乎都因对所生成的数据跟踪质量、数量和速度进行分类而存在根本性的缺陷。因此,大多数目前的通信标识方法都基于自动生成合成网络痕迹,这掩盖了对正常行为和恶意行为进行正确区分所必要的许多基本方面。另一种方法是将非专家用户纳入真实的标签标签过程中,因此,必须采用一种不断验证的统计检测方法。