Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. However, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. By proposing PromptInject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how GPT-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. In particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit GPT-3's stochastic nature, creating long-tail risks. The code for PromptInject is available at https://github.com/agencyenterprise/PromptInject.
翻译:以变换器为基础的大型语言模型(LLMs)为大规模客户化应用中的自然语言任务提供了强有力的基础。然而,探索恶意用户互动所产生的脆弱性的研究却很少。我们通过提出PaintInject(基于面具的迭代对立即时组合的标语统一框架)来研究GPT-3(GPT-3,在生产过程中最广泛使用的语文模型)如何很容易被简单手工艺的投入误差。特别是,我们调查了两种类型的袭击 -- -- 目标劫持和迅速泄漏 -- -- 并表明即使低性、但足够恶意的代理商也能很容易地利用GPT-3的随机性,从而产生长尾风险。《快速输入代码》可在https://github.com/organitical Institive/PromptInject查阅。