Large Language Models (LLMs) have rapidly become integral to real-world applications, powering services across diverse sectors. However, their widespread deployment has exposed critical security risks, particularly through jailbreak prompts that can bypass model alignment and induce harmful outputs. Despite intense research into both attack and defense techniques, the field remains fragmented: definitions, threat models, and evaluation criteria vary widely, impeding systematic progress and fair comparison. In this Systematization of Knowledge (SoK), we address these challenges by (1) proposing a holistic, multi-level taxonomy that organizes attacks, defenses, and vulnerabilities in LLM prompt security; (2) formalizing threat models and cost assumptions into machine-readable profiles for reproducible evaluation; (3) introducing an open-source evaluation toolkit for standardized, auditable comparison of attacks and defenses; (4) releasing JAILBREAKDB, the largest annotated dataset of jailbreak and benign prompts to date;\footnote{The dataset is released at \href{https://huggingface.co/datasets/youbin2014/JailbreakDB}{\textcolor{purple}{https://huggingface.co/datasets/youbin2014/JailbreakDB}}.} and (5) presenting a comprehensive evaluation platform and leaderboard of state-of-the-art methods \footnote{will be released soon.}. Our work unifies fragmented research, provides rigorous foundations for future studies, and supports the development of robust, trustworthy LLMs suitable for high-stakes deployment.
翻译:大语言模型已迅速成为现实世界应用不可或缺的组成部分,为不同领域的服务提供核心动力。然而,其广泛部署暴露了严重的安全风险,特别是通过越狱提示绕过模型对齐机制并诱导有害输出的问题。尽管针对攻击与防御技术的研究日益深入,该领域仍处于碎片化状态:定义、威胁模型和评估标准差异巨大,阻碍了系统性进展与公平比较。在本知识系统化工作中,我们通过以下方式应对这些挑战:(1) 提出一个整体性、多层次分类法,用于组织LLM提示安全中的攻击、防御与漏洞;(2) 将威胁模型与成本假设形式化为机器可读配置文件,以实现可复现的评估;(3) 引入开源评估工具包,用于对攻击和防御方法进行标准化、可审计的比较;(4) 发布迄今最大规模的带标注越狱与良性提示数据集JAILBREAKDB\footnote{数据集发布于 \href{https://huggingface.co/datasets/youbin2014/JailbreakDB}{\textcolor{purple}{https://huggingface.co/datasets/youbin2014/JailbreakDB}}。};(5) 构建包含前沿方法的综合性评估平台与排行榜\footnote{即将发布。}。我们的工作整合了碎片化的研究,为未来研究提供了严谨的基础,并支持开发适用于高风险部署场景的鲁棒、可信赖的大语言模型。