Recent advances in generative models have led to their application in password guessing, with the aim of replicating the complexity, structure, and patterns of human-created passwords. Despite their potential, inconsistencies and inadequate evaluation methodologies in prior research have hindered meaningful comparisons and a comprehensive, unbiased understanding of their capabilities. This paper introduces MAYA, a unified, customizable, plug-and-play benchmarking framework designed to facilitate the systematic characterization and benchmarking of generative password-guessing models in the context of trawling attacks. Using MAYA, we conduct a comprehensive assessment of six state-of-the-art approaches, which we re-implemented and adapted to ensure standardization. Our evaluation spans eight real-world password datasets and covers an exhaustive set of advanced testing scenarios, totaling over 15,000 compute hours. Our findings indicate that these models effectively capture different aspects of human password distribution and exhibit strong generalization capabilities. However, their effectiveness varies significantly with long and complex passwords. Through our evaluation, sequential models consistently outperform other generative architectures and traditional password-guessing tools, demonstrating unique capabilities in generating accurate and complex guesses. Moreover, the diverse password distributions learned by the models enable a multi-model attack that outperforms the best individual model. By releasing MAYA, we aim to foster further research, providing the community with a new tool to consistently and reliably benchmark generative password-guessing models. Our framework is publicly available at https://github.com/williamcorrias/MAYA-Password-Benchmarking.
翻译:生成模型的最新进展促使其在密码猜测领域得到应用,旨在复现人类创建密码的复杂性、结构及模式。尽管潜力巨大,但先前研究中的不一致性和不充分的评估方法阻碍了有意义的比较,也妨碍了对这些模型能力进行全面、无偏见的理解。本文提出MAYA,一个统一、可定制、即插即用的基准测试框架,旨在促进在拖网攻击背景下对生成式密码猜测模型进行系统性表征与基准测试。利用MAYA,我们对六种最先进的方法进行了全面评估,这些方法均经过重新实现和调整以确保标准化。我们的评估覆盖八个真实世界密码数据集,并涵盖一套详尽的高级测试场景,总计超过15,000计算小时。研究结果表明,这些模型能有效捕捉人类密码分布的不同方面,并展现出强大的泛化能力。然而,对于长且复杂的密码,其有效性存在显著差异。在我们的评估中,序列模型始终优于其他生成架构和传统密码猜测工具,在生成准确且复杂的猜测方面展现出独特能力。此外,模型学习到的多样化密码分布使得多模型攻击能够超越最佳单一模型。通过发布MAYA,我们旨在推动进一步研究,为社区提供一个能够一致且可靠地对生成式密码猜测模型进行基准测试的新工具。我们的框架已在 https://github.com/williamcorrias/MAYA-Password-Benchmarking 公开提供。