形式化与基准化提示注入攻击与防御方法 (Formalizing and Benchmarking Prompt Injection Attacks and Defenses)

from arxiv, Published in USENIX Security Symposium 2024; the model sizes for closed-source models are from blog posts. For slides, see https://people.duke.edu/~zg70/code/PromptInjection.pdf

A prompt injection attack aims to inject malicious instruction/data into the input of an LLM-Integrated Application such that it produces results as an attacker desires. Existing works are limited to case studies. As a result, the literature lacks a systematic understanding of prompt injection attacks and their defenses. We aim to bridge the gap in this work. In particular, we propose a framework to formalize prompt injection attacks. Existing attacks are special cases in our framework. Moreover, based on our framework, we design a new attack by combining existing ones. Using our framework, we conduct a systematic evaluation on 5 prompt injection attacks and 10 defenses with 10 LLMs and 7 tasks. Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses. To facilitate research on this topic, we make our platform public at https://github.com/liu00222/Open-Prompt-Injection.

翻译：提示注入攻击旨在向大型语言模型集成应用的输入中注入恶意指令/数据，使其按照攻击者意图生成结果。现有研究仅限于案例探讨，导致学术界对提示注入攻击及其防御机制缺乏系统性认知。本研究致力于填补这一空白。具体而言，我们提出了形式化提示注入攻击的框架体系，现有攻击方法均可视为本框架的特例。基于该框架，我们通过融合现有攻击手段设计出新型攻击方法。依托本框架，我们系统评估了5种提示注入攻击与10种防御策略在10个大型语言模型及7类任务中的表现。本研究为未来提示注入攻击与防御的量化评估建立了通用基准平台。为促进该领域研究，我们已将实验平台开源发布于 https://github.com/liu00222/Open-Prompt-Injection。