As large language models (LLMs) are increasingly deployed in high-stakes domains, ensuring their security and alignment has become a critical challenge. Existing red-teaming practices depend heavily on manual testing, which limits scalability and fails to comprehensively cover the vast space of potential adversarial behaviors. This paper introduces an automated red-teaming framework that systematically generates, executes, and evaluates adversarial prompts to uncover security vulnerabilities in LLMs. Our framework integrates meta-prompting-based attack synthesis, multi-modal vulnerability detection, and standardized evaluation protocols spanning six major threat categories -- reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Experiments on the GPT-OSS-20B model reveal 47 distinct vulnerabilities, including 21 high-severity and 12 novel attack patterns, achieving a $3.9\times$ improvement in vulnerability discovery rate over manual expert testing while maintaining 89\% detection accuracy. These results demonstrate the framework's effectiveness in enabling scalable, systematic, and reproducible AI safety evaluations. By providing actionable insights for improving alignment robustness, this work advances the state of automated LLM red-teaming and contributes to the broader goal of building secure and trustworthy AI systems.
翻译:随着大型语言模型(LLMs)在高风险领域日益广泛部署,确保其安全性与对齐性已成为关键挑战。现有的红队实践严重依赖人工测试,这限制了可扩展性,且难以全面覆盖潜在对抗行为的广阔空间。本文提出一种自动化红队框架,能够系统性地生成、执行和评估对抗性提示,以揭示LLMs中的安全漏洞。该框架集成了基于元提示的攻击合成、多模态漏洞检测以及覆盖六大主要威胁类别的标准化评估协议——包括奖励黑客攻击、欺骗性对齐、数据窃取、消极抵抗、不当工具使用和思维链操纵。在GPT-OSS-20B模型上的实验发现了47个不同的漏洞,其中包含21个高严重性漏洞和12种新型攻击模式,相较于人工专家测试,漏洞发现率提升了$3.9$倍,同时保持了89%的检测准确率。这些结果证明了该框架在实现可扩展、系统化、可复现的AI安全评估方面的有效性。通过为提升对齐鲁棒性提供可操作的见解,本工作推进了自动化LLM红队技术的前沿,并为构建安全可信的AI系统这一更广泛目标作出贡献。