Research and development of techniques which detect or remediate malicious network activity require access to diverse, realistic, contemporary data sets containing labeled malicious connections. In the absence of such data, said techniques cannot be meaningfully trained, tested, and evaluated. Synthetically produced data containing fabricated or merged network traffic is of limited value as it is easily distinguishable from real traffic by even simple machine learning (ML) algorithms. Real network data is preferable, but while ubiquitous is broadly both sensitive and lacking in ground truth labels, limiting its utility for ML research. This paper presents a multi-faceted approach to generating a data set of labeled malicious connections embedded within anonymized network traffic collected from large production networks. Real-world malware is defanged and introduced to simulated, secured nodes within those networks to generate realistic traffic while maintaining sufficient isolation to protect real data and infrastructure. Network sensor data, including this embedded malware traffic, is collected at a network edge and anonymized for research use. Network traffic was collected and produced in accordance with the aforementioned methods at two major educational institutions. The result is a highly realistic, long term, multi-institution data set with embedded data labels spanning over 1.5 trillion connections and over a petabyte of sensor log data. The usability of this data set is demonstrated by its utility to our artificial intelligence and machine learning (AI/ML) research program.
翻译:检测或补救恶意网络活动的技术的研发,需要获取多种、现实的当代数据集,其中包括有标签的恶意联系。在缺乏这些数据的情况下,上述技术无法进行有意义的培训、测试和评价。合成产生的含有编造或合并网络通信的数据价值有限,因为它很容易通过简单的机器学习算法与实际交通区分开来,因为即使是简单的机器学习算法也很容易与实际交通区分开来。真正的网络数据是可取的,但虽然无处不在,既敏感又缺乏地面真实标签,限制了其用于ML研究的实用性。本文介绍了在大型生产网络收集的匿名网络通信中生成一组有标签的恶意联系的数据集的多面方法。现实性、长期性恶意软件被拆解并引入到这些网络的模拟、安全节点中,以产生现实性交通,同时保持足够的距离来保护真实的数据和基础设施。网络传感器数据,包括嵌入的恶意通信流量,是在网络边缘收集的,供研究使用。网络通信量按照上述方法在两个主要教育机构收集并制作。结果是一种非常现实的、长期的、有标签的恶意的网络连接。多用途数据,这是我们通过一个跨机路路路路的模型的模型的数据收集。