基于指令遵循意图分析的间接提示注入攻击缓解方法 (Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis)

Indirect prompt injection attacks (IPIAs), where large language models (LLMs) follow malicious instructions hidden in input data, pose a critical threat to LLM-powered agents. In this paper, we present IntentGuard, a general defense framework based on instruction-following intent analysis. The key insight of IntentGuard is that the decisive factor in IPIAs is not the presence of malicious text, but whether the LLM intends to follow instructions from untrusted data. Building on this insight, IntentGuard leverages an instruction-following intent analyzer (IIA) to identify which parts of the input prompt the model recognizes as actionable instructions, and then flag or neutralize any overlaps with untrusted data segments. To instantiate the framework, we develop an IIA that uses three "thinking intervention" strategies to elicit a structured list of intended instructions from reasoning-enabled LLMs. These techniques include start-of-thinking prefilling, end-of-thinking refinement, and adversarial in-context demonstration. We evaluate IntentGuard on two agentic benchmarks (AgentDojo and Mind2Web) using two reasoning-enabled LLMs (Qwen-3-32B and gpt-oss-20B). Results demonstrate that IntentGuard achieves (1) no utility degradation in all but one setting and (2) strong robustness against adaptive prompt injection attacks (e.g., reducing attack success rates from 100% to 8.5% in a Mind2Web scenario).

翻译：间接提示注入攻击（IPIAs）是指大型语言模型（LLMs）遵循输入数据中隐藏的恶意指令，这对基于LLM的智能体构成了严重威胁。本文提出IntentGuard，一种基于指令遵循意图分析的通用防御框架。IntentGuard的核心洞见在于：IPIAs的决定性因素并非恶意文本的存在，而是LLM是否意图遵循来自不可信数据的指令。基于这一洞见，IntentGuard利用指令遵循意图分析器（IIA）识别模型将输入提示的哪些部分视为可执行指令，并标记或消除与不可信数据段的重叠部分。为实例化该框架，我们开发了一种IIA，采用三种“思维干预”策略从具备推理能力的LLMs中提取结构化的预期指令列表。这些技术包括思维起始预填充、思维终止精炼和对抗性上下文演示。我们在两个智能体基准测试（AgentDojo和Mind2Web）上使用两种具备推理能力的LLMs（Qwen-3-32B和gpt-oss-20B）评估IntentGuard。结果表明，IntentGuard实现了（1）除一种设置外均无性能下降，以及（2）对自适应提示注入攻击的强大鲁棒性（例如在Mind2Web场景中将攻击成功率从100%降至8.5%）。