Several cybersecurity domains, such as ransomware detection, forensics and data analysis, require methods to reliably identify encrypted data fragments. Typically, current approaches employ statistics derived from byte-level distribution, such as entropy estimation, to identify encrypted fragments. However, modern content types use compression techniques which alter data distribution pushing it closer to the uniform distribution. The result is that current approaches exhibit unreliable encryption detection performance when compressed data appears in the dataset. Furthermore, proposed approaches are typically evaluated over few data types and fragment sizes, making it hard to assess their practical applicability. This paper compares existing statistical tests on a large, standardized dataset and shows that current approaches consistently fail to distinguish encrypted and compressed data on both small and large fragment sizes. We address these shortcomings and design EnCoD, a learning-based classifier which can reliably distinguish compressed and encrypted data. We evaluate EnCoD on a dataset of 16 different file types and fragment sizes ranging from 512B to 8KB. Our results highlight that EnCoD outperforms current approaches by a wide margin, with accuracy ranging from ~82 for 512B fragments up to ~92 for 8KB data fragments. Moreover, EnCoD can pinpoint the exact format of a given data fragment, rather than performing only binary classification like previous approaches.
翻译:一些网络安全领域,如赎金软件的检测、法证和数据分析,要求采用可靠的方法可靠地识别加密数据碎片。通常,目前的方法采用来自字层分布的统计数据,例如英特罗普估计,以识别加密碎片。然而,现代内容类型使用压缩技术,改变数据分布,将数据分配推向统一分布。结果是,当数据集中出现压缩数据时,当前方法的加密检测性能不可靠。此外,建议的方法通常在少数数据类型和碎片大小的基础上进行评估,难以评估其实际适用性。本文比较了大规模标准化数据集的现有统计测试,并表明当前方法始终无法区分大小碎片大小的加密和压缩数据。我们处理这些缺陷,设计了基于学习的分类方法EnCoD,这是一个能够可靠地区分压缩和加密数据的分类。我们用16个不同文件类型和碎片大小从512B到8KB的数据集来评估EnCoD。我们的结果显示,EconD比目前的方法大范围,其精确度从~82到5B碎片碎片的加密和压缩数据-92,而只能进行先前的分类。