The development of machine learning algorithms in the cyber security domain has been impeded by the complex, hierarchical, sequential and multimodal nature of the data involved. In this paper we introduce the notion of a streaming tree as a generic data structure encompassing a large portion of real-world cyber security data. Starting from host-based event logs we represent computer processes as streaming trees that evolve in continuous time. Leveraging the properties of the signature kernel, a machine learning tool that recently emerged as a leading technology for learning with complex sequences of data, we develop the SK-Tree algorithm. SK-Tree is a supervised learning method for systematic malware detection on streaming trees that is robust to irregular sampling and high dimensionality of the underlying streams. We demonstrate the effectiveness of SK-Tree to detect malicious events on a portion of the publicly available DARPA OpTC dataset, achieving an AUROC score of 98%.
翻译:网络安全领域机器学习算法的发展由于所涉数据的复杂性、等级性、顺序性和多式联运性质而受到阻碍。在本文中,我们引入了流树概念,作为包含大量真实世界网络安全数据的通用数据结构。从基于主机的事件日志开始,我们将计算机过程作为不断演化的流树来代表。利用签字内核的特性,这是最近作为以复杂数据序列进行学习的领先技术而出现的机器学习工具,我们开发了SK-Tree算法。SK-Tree是一种监督的学习方法,用于在流树上系统检测恶意软件,对于不规则采样和深层流的高度维度是很强的。我们展示了SK-Tree在可公开获取的DARPA OpTC数据集中的一部分检测恶意事件的有效性,达到了98 %的AUROC分数。