Zipf-Gramming：将字节N元语法扩展至生产级恶意软件语料库 (Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora)

A classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we've consistently found that 6-8 grams achieve the best accuracy on our production deployments but have been unable to deploy regularly updated models due to the high cost of finding the top-k most frequent n-grams over terabytes of executable programs. Because the Zipfian distribution well models the distribution of n-grams, we exploit its properties to develop a new top-k n-gram extractor that is up to $35\times$ faster than the previous best alternative. Using our new Zipf-Gramming algorithm, we are able to scale up our production training set and obtain up to 30\% improvement in AUC at detecting new malware. We show theoretically and empirically that our approach will select the top-k items with little error and the interplay between theory and engineering required to achieve these results.

翻译：使用字节N元语法作为特征的分类器是我们发现的唯一能够满足多种恶意软件检测场景部署要求的方法，这些要求包括模型大小（小于2 MB）、处理速度（每秒数GB）和延迟（低于10毫秒）。然而，我们始终发现，6至8元语法在生产部署中实现了最佳准确率，但由于在数TB的可执行程序中查找前k个最频繁N元语法的成本过高，我们一直无法部署定期更新的模型。由于Zipf分布能够很好地模拟N元语法的分布，我们利用其特性开发了一种新的前k个N元语法提取器，其速度比之前最佳替代方案快高达$35\times$。通过使用我们新的Zipf-Gramming算法，我们能够扩展生产训练集，并在检测新恶意软件时获得高达30%的AUC提升。我们从理论和实证两方面证明，我们的方法能以极小的误差选择前k个项目，并阐述了实现这些结果所需的理论与工程之间的相互作用。