Data representation plays a critical role in the performance of novelty detection (or ``anomaly detection'') methods in machine learning. The data representation of network traffic often determines the effectiveness of these models as much as the model itself. The wide range of novel events that network operators need to detect (e.g., attacks, malware, new applications, changes in traffic demands) introduces the possibility for a broad range of possible models and data representations. In each scenario, practitioners must spend significant effort extracting and engineering features that are most predictive for that situation or application. While anomaly detection is well-studied in computer networking, much existing work develops specific models that presume a particular representation -- often IPFIX/NetFlow. Yet, other representations may result in higher model accuracy, and the rise of programmable networks now makes it more practical to explore a broader range of representations. To facilitate such exploration, we develop a systematic framework, open-source toolkit, and public Python library that makes it both possible and easy to extract and generate features from network traffic and perform and end-to-end evaluation of these representations across most prevalent modern novelty detection models. We first develop and publicly release an open-source tool, an accompanying Python library (NetML), and end-to-end pipeline for novelty detection in network traffic. Second, we apply this tool to five different novelty detection problems in networking, across a range of scenarios from attack detection to novel device detection. Our findings general insights and guidelines concerning which features appear to be more appropriate for particular situations.
翻译:在机器学习中,新探测(或“异常探测”的检测方法的绩效中,数据代表数据代表在机器学习的新发现(或“异常检测”方法的绩效中发挥着关键作用。网络交通的数据表述往往决定这些模型的有效性,与模型本身一样。网络运营商需要检测的范围广泛的新事件(例如攻击、恶意软件、新应用、交通需求的变化等)为广泛的可能模式和数据表述提供了可能性。在每种情况下,从业人员必须花大量精力提取和工程特征,而这些特征对于这种情况或应用来说最具有预测力。虽然异常现象的检测在计算机联网中得到了很好的研究,但许多现有工作开发了假定某种特定代表特征的具体模型 -- -- 往往是IPGIX/NetFlow。然而,其他表述可能会提高模型的准确性,而可编程网络的兴起使探索范围更加广泛的可能性。为了便利这种探索,我们开发了一个系统化的框架、公开源工具以及公共的Pythson图书馆,从而有可能和生成和生成新的特征特征,在最普遍的现代的网络检测模式中,我们开发并公开发布和最终评价这些特征的网络检测工具。