Current Large Language Models (LLMs) safety approaches focus on explicitly harmful content while overlooking a critical vulnerability: the inability to understand context and recognize user intent. This creates exploitable vulnerabilities that malicious users can systematically leverage to circumvent safety mechanisms. We empirically evaluate multiple state-of-the-art LLMs, including ChatGPT, Claude, Gemini, and DeepSeek. Our analysis demonstrates the circumvention of reliable safety mechanisms through emotional framing, progressive revelation, and academic justification techniques. Notably, reasoning-enabled configurations amplified rather than mitigated the effectiveness of exploitation, increasing factual precision while failing to interrogate the underlying intent. The exception was Claude Opus 4.1, which prioritized intent detection over information provision in some use cases. This pattern reveals that current architectural designs create systematic vulnerabilities. These limitations require paradigmatic shifts toward contextual understanding and intent recognition as core safety capabilities rather than post-hoc protective mechanisms.
翻译:当前大语言模型(LLMs)的安全方法主要关注显性有害内容,却忽视了一个关键脆弱性:无法理解上下文和识别用户意图。这造成了可利用的漏洞,恶意用户可以系统性地利用这些漏洞来规避安全机制。我们通过实证评估了包括ChatGPT、Claude、Gemini和DeepSeek在内的多个前沿大语言模型。我们的分析表明,通过情感框架、渐进式揭示和学术论证等技术,可以成功规避可靠的安全机制。值得注意的是,启用推理功能的配置非但没有减轻反而放大了利用效果,在提高事实准确性的同时未能审视潜在意图。唯一的例外是Claude Opus 4.1,在某些使用场景中优先进行意图检测而非信息提供。这种模式揭示出现有架构设计造成了系统性漏洞。这些局限性要求范式转变,将上下文理解和意图识别作为核心安全能力,而非事后保护机制。