当智能失效：关于大语言模型在密码破解中表现不佳的实证研究 (When Intelligence Fails: An Empirical Study on Why LLMs Struggle with Password Cracking)

The remarkable capabilities of Large Language Models (LLMs) in natural language understanding and generation have sparked interest in their potential for cybersecurity applications, including password guessing. In this study, we conduct an empirical investigation into the efficacy of pre-trained LLMs for password cracking using synthetic user profiles. Specifically, we evaluate the performance of state-of-the-art open-source LLMs such as TinyLLaMA, Falcon-RW-1B, and Flan-T5 by prompting them to generate plausible passwords based on structured user attributes (e.g., name, birthdate, hobbies). Our results, measured using Hit@1, Hit@5, and Hit@10 metrics under both plaintext and SHA-256 hash comparisons, reveal consistently poor performance, with all models achieving less than 1.5% accuracy at Hit@10. In contrast, traditional rule-based and combinator-based cracking methods demonstrate significantly higher success rates. Through detailed analysis and visualization, we identify key limitations in the generative reasoning of LLMs when applied to the domain-specific task of password guessing. Our findings suggest that, despite their linguistic prowess, current LLMs lack the domain adaptation and memorization capabilities required for effective password inference, especially in the absence of supervised fine-tuning on leaked password datasets. This study provides critical insights into the limitations of LLMs in adversarial contexts and lays the groundwork for future efforts in secure, privacy-preserving, and robust password modeling.

翻译：大语言模型在自然语言理解和生成方面展现出的卓越能力，激发了人们对其在网络安全应用（包括密码猜测）中潜力的兴趣。本研究通过合成用户配置文件，对预训练大语言模型在密码破解方面的效能进行了实证调查。具体而言，我们评估了诸如TinyLLaMA、Falcon-RW-1B和Flan-T5等先进开源大语言模型的性能，方法是提示它们基于结构化的用户属性（例如姓名、出生日期、爱好）生成看似合理的密码。我们的结果，通过明文和SHA-256哈希比较下的Hit@1、Hit@5和Hit@10指标进行衡量，显示出一致的较差性能，所有模型在Hit@10下的准确率均低于1.5%。相比之下，传统的基于规则和基于组合的破解方法则表现出显著更高的成功率。通过详细的分析和可视化，我们揭示了大语言模型在应用于密码猜测这一特定领域任务时，其生成式推理存在的主要局限性。我们的研究结果表明，尽管当前的大语言模型具备语言能力，但它们缺乏有效密码推断所需的领域适应和记忆能力，尤其是在缺乏对泄露密码数据集进行监督微调的情况下。本研究为理解大语言模型在对抗性环境中的局限性提供了重要见解，并为未来在安全、隐私保护且鲁棒的密码建模方面的努力奠定了基础。