Current Large Language Models (LLMs) safety approaches focus on explicitly harmful content while overlooking a critical vulnerability: the inability to understand context and recognize user intent. This creates exploitable vulnerabilities that malicious users can systematically leverage to circumvent safety mechanisms. We empirically evaluate multiple state-of-the-art LLMs, including ChatGPT, Claude, Gemini, and DeepSeek. Our analysis demonstrates the circumvention of reliable safety mechanisms through emotional framing, progressive revelation, and academic justification techniques. Notably, reasoning-enabled configurations amplified rather than mitigated the effectiveness of exploitation, increasing factual precision while failing to interrogate the underlying intent. The exception was Claude Opus 4.1, which prioritized intent detection over information provision in some use cases. This pattern reveals that current architectural designs create systematic vulnerabilities. These limitations require paradigmatic shifts toward contextual understanding and intent recognition as core safety capabilities rather than post-hoc protective mechanisms.
翻译:当前大语言模型(LLMs)的安全方法主要关注显性有害内容,却忽视了一个关键脆弱性:无法理解上下文和识别用户意图。这造成了可利用的漏洞,恶意用户能够系统性地利用这些漏洞规避安全机制。我们通过实证评估了包括ChatGPT、Claude、Gemini和DeepSeek在内的多个先进大语言模型。我们的分析表明,通过情感框架构建、渐进式信息揭示和学术化论证等技术,可以成功规避原本可靠的安全机制。值得注意的是,启用推理功能的配置非但没有缓解攻击效果,反而放大了其有效性——在提高事实精确性的同时未能审视深层意图。唯一的例外是Claude Opus 4.1,它在某些使用场景中优先进行意图检测而非信息提供。这一模式揭示了当前架构设计存在系统性漏洞。这些局限性要求我们进行范式转变,将上下文理解和意图识别作为核心安全能力进行建设,而非仅作为事后保护机制。