知识系统化：大语言模型中提示安全性的分类与评估 (SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models)

Large Language Models (LLMs) have rapidly become integral to real-world applications, powering services across diverse sectors. However, their widespread deployment has exposed critical security risks, particularly through jailbreak prompts that can bypass model alignment and induce harmful outputs. Despite intense research into both attack and defense techniques, the field remains fragmented: definitions, threat models, and evaluation criteria vary widely, impeding systematic progress and fair comparison. In this Systematization of Knowledge (SoK), we address these challenges by (1) proposing a holistic, multi-level taxonomy that organizes attacks, defenses, and vulnerabilities in LLM prompt security; (2) formalizing threat models and cost assumptions into machine-readable profiles for reproducible evaluation; (3) introducing an open-source evaluation toolkit for standardized, auditable comparison of attacks and defenses; (4) releasing JAILBREAKDB, the largest annotated dataset of jailbreak and benign prompts to date;\footnote{The dataset is released at \href{https://huggingface.co/datasets/youbin2014/JailbreakDB}{\textcolor{purple}{https://huggingface.co/datasets/youbin2014/JailbreakDB}}.} and (5) presenting a comprehensive evaluation platform and leaderboard of state-of-the-art methods \footnote{will be released soon.}. Our work unifies fragmented research, provides rigorous foundations for future studies, and supports the development of robust, trustworthy LLMs suitable for high-stakes deployment.

翻译：大语言模型已迅速成为现实世界应用不可或缺的组成部分，为不同领域的服务提供核心动力。然而，其广泛部署暴露了严重的安全风险，特别是通过越狱提示绕过模型对齐机制并诱导有害输出的问题。尽管针对攻击与防御技术的研究日益深入，该领域仍处于碎片化状态：定义、威胁模型和评估标准差异巨大，阻碍了系统性进展与公平比较。在本知识系统化工作中，我们通过以下方式应对这些挑战：(1) 提出一个整体性、多层次分类法，用于组织LLM提示安全中的攻击、防御与漏洞；(2) 将威胁模型与成本假设形式化为机器可读配置文件，以实现可复现的评估；(3) 引入开源评估工具包，用于对攻击和防御方法进行标准化、可审计的比较；(4) 发布迄今最大规模的带标注越狱与良性提示数据集JAILBREAKDB\footnote{数据集发布于 \href{https://huggingface.co/datasets/youbin2014/JailbreakDB}{\textcolor{purple}{https://huggingface.co/datasets/youbin2014/JailbreakDB}}。}；(5) 构建包含前沿方法的综合性评估平台与排行榜\footnote{即将发布。}。我们的工作整合了碎片化的研究，为未来研究提供了严谨的基础，并支持开发适用于高风险部署场景的鲁棒、可信赖的大语言模型。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日