Fuzzy hashes are an important tool in digital forensics and are used in approximate matching to determine the similarity between digital artifacts. They translate the byte code of files into computable strings, which makes them particularly interesting for intelligent machine processing. In this work, we propose deep learning approximate matching (DLAM), which achieves much higher accuracy in detecting anomalies in fuzzy hashes than conventional approaches. In addition to the well-known application for clustering malware, we show that fuzzy hashes and deep learning are indeed well-suited to classify files according to the presence of certain content, e.g., malware. DLAM relies on transformer-based models from the field of natural language processing and outperforms existing methods. Traditional fuzzy hashes like TLSH and ssdeep have a limited size and fail to detect file anomalies if they are relatively small compared to the overall file size. DLAM, however, enables the detection of such file correlations in the computed fuzzy hashes of TLSH and ssdeep, even for anomaly sizes of less than 15%. It achieves comparable results to state-of-the-art fuzzy hashing algorithms while relying on more efficient hash computations and can, therefore, be used at a much larger scale.
翻译:Fuzzy hashes是数字法证中的一个重要工具, 用来大致匹配数字工艺品之间的相似性。 它们将文件的字节代码转换成可比较的字符串, 这使得它们对于智能机器处理特别有趣。 在这项工作中, 我们提议深学习近似匹配( DLAM ), 它比常规方法在发现模糊的杂物中的异常方面达到比常规方法更高的精度。 除了已知的组合恶意软件应用外, 我们显示, 模糊的杂交和深学习确实非常适合根据某些内容的存在来分类文件, 例如, 恶意软件。 DLAM 依赖自然语言处理领域的基于变异器的模型, 并且超越了现有方法。 传统的杂交比像 TLSH 和sdepth 等的杂交比重比常规方法要小得多。 但是, DLAM 能够检测到在计算 fuzzy 中这样的文件关联性, 包括TLSH和sdeep, 即使是基于自然语言处理领域的变异器模型模型模型, 也比15 % 的变异性计算法度要大。 因此, 它可以比亚小得多。 的变法计算方法可以比小。