It was found when reviewing the ransomware detection research literature that almost no proposal provided enough detail on how the test data set was created, or sufficient description of its actual content, to allow it to be recreated by other researchers interested in reconstructing their environment and validating the research results. A modern cybersecurity mixed file data set called NapierOne is presented, primarily aimed at, but not limited to, ransomware detection and forensic analysis research. NapierOne was designed to address this deficiency in reproducibility and improve consistency by facilitating research replication and repeatability. The methodology used in the creation of this data set is also described in detail. The data set was inspired by the Govdocs1 data set and it is intended that NapierOne be used as a complement to this original data set. An investigation was performed with the goal of determining the common files types currently in use. No specific research was found that explicitly provided this information, so an alternative consensus approach was employed. This involved combining the findings from multiple sources of file type usage into an overall ranked list. After which 5000 real-world example files were gathered, and a specific data subset created, for each of the common file types identified. In some circumstances, multiple data subsets were created for a specific file type, each subset representing a specific characteristic for that file type. For example, there are multiple data subsets for the ZIP file type with each subset containing examples of a specific compression method. Ransomware execution tends to produce files that have high entropy, so examples of file types that naturally have this attribute are also present.
翻译:在审查赎金软件检测研究文献时发现,几乎没有任何建议提供了足够详细的关于如何创建测试数据集的细节,或对数据组实际内容的足够描述,以便让有兴趣重建环境并验证研究成果的其他研究人员重新创建数据集。介绍了称为NapierOne的现代网络安全混合文件数据集,主要针对但不限于赎金软件检测和法证分析研究。NapierOne旨在通过便利研究复制和重复来弥补在复制和增强一致性方面的这一缺陷。创建该数据集时使用的方法也作了详细描述。数据集受Govdocs1数据集的启发,目的是将NapierOi用作原始数据集的补充。开展了一项调查,目的是确定目前使用的通用文件类型。没有发现具体研究明确提供了这一信息,因此采用了替代性共识方法。这涉及将多个文件类型使用来源的调查结果合并到一个总体排名列表中。在收集了5000个真实世界示例文件之后,为每个常规文件类型中的每个特定类型,每个特定类型,每个特定类别,一个特定类别,一个特定类别,一个特定类别,一个特定类别,一个特定类别,一个特定类别,一个特定类别,每个类别,一个特定类别。