Modern encryption algorithms form the foundation of digital security. However, the widespread use of encryption algorithms results in significant challenges for network defenders in identifying which specific algorithms are being employed. More importantly, we find that when the plaintext distribution of test data departs from the training data, the performance of classifiers often declines significantly. This issue exposes the feature extractor's hidden dependency on plaintext features. To reduce this dependency, we adopt a method that does not learn end-to-end from ciphertext bytes. Specifically, this method is based on a set of statistical tests to compute the randomness feature of the ciphertext, and then uses the frequency distribution pattern of this feature to construct the algorithms' respective fingerprints. The experimental results demonstrate that our method achieves high discriminative performance (e.g., AUC > 0.98) in the Canterbury Corpus dataset, which contains a diverse set of data types. Furthermore, in our cross-domain evaluation, baseline models' performance degrades significantly when tested on data with a reduced proportion of structured plaintext. In sharp contrast, our method demonstrates high robustness: performance degradation is minimal when transferring between different structured domains, and even on the most challenging purely random dataset, it maintains a high level of ranking ability (AUC > 0.90).
翻译:现代加密算法构成了数字安全的基石。然而,加密算法的广泛使用给网络防御者识别具体采用的算法带来了重大挑战。更重要的是,我们发现当测试数据的明文分布与训练数据存在差异时,分类器的性能往往显著下降。这一问题揭示了特征提取器对明文特征的潜在依赖性。为降低这种依赖性,我们采用了一种不从密文字节端到端学习的方法。具体而言,该方法基于一组统计测试来计算密文的随机性特征,然后利用该特征的频率分布模式构建各算法的相应指纹。实验结果表明,在包含多种数据类型的坎特伯雷语料库数据集上,我们的方法实现了高区分性能(例如AUC > 0.98)。此外,在我们的跨域评估中,当测试数据中结构化明文比例降低时,基线模型的性能显著下降。与此形成鲜明对比的是,我们的方法展现出高度鲁棒性:在不同结构化领域间迁移时性能下降极小,即使在最具挑战性的纯随机数据集上,仍保持高水平的排序能力(AUC > 0.90)。