LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks.
翻译:集成大语言模型(LLM)的应用与智能体易受提示注入攻击的影响,攻击者通过向输入中注入恶意提示以诱导模型产生符合攻击者意图的输出。检测方法的目标是判断给定输入是否受到注入提示的污染。然而,现有检测方法对当前先进攻击的检测效果有限,更难以应对自适应攻击。本研究提出DataSentinel,一种基于博弈论的提示注入攻击检测方法。具体而言,DataSentinel通过微调大语言模型来检测那些经过策略性调整以规避检测的注入提示所污染的输入。我们将此问题形式化为一个极小极大优化问题,其目标是通过微调使大语言模型能够检测强大的自适应攻击。此外,我们提出一种基于梯度的求解方法,通过交替求解内部极大化与外部极小化问题来解决该优化问题。在多个基准数据集和大语言模型上的评估结果表明,DataSentinel能有效检测现有及自适应的提示注入攻击。