Large Language Models (LLMs) are increasingly popular, powering a wide range of applications. Their widespread use has sparked concerns, especially through jailbreak attacks that bypass safety measures to produce harmful content. In this paper, we present a comprehensive security analysis of large language models (LLMs), addressing critical research questions on the evolution and determinants of model safety. Specifically, we begin by identifying the most effective techniques for detecting jailbreak attacks. Next, we investigate whether newer versions of LLMs offer improved security compared to their predecessors. We also assess the impact of model size on overall security and explore the potential benefits of integrating multiple defense strategies to enhance the security. Our study evaluates both open-source (e.g., LLaMA and Mistral) and closed-source models (e.g., GPT-4) by employing four state-of-the-art attack techniques and assessing the efficacy of three new defensive approaches.
翻译:大型语言模型(LLMs)正日益普及,为各类应用提供核心支持。其广泛使用引发了诸多担忧,尤其是通过越狱攻击绕过安全机制生成有害内容的问题。本文对大型语言模型(LLMs)进行了全面的安全性分析,重点探讨模型安全性演进规律及其决定因素等核心研究问题。具体而言,我们首先识别了检测越狱攻击的最有效技术;其次,研究了新版LLMs是否较早期版本具有更强的安全性;同时评估了模型规模对整体安全性的影响,并探索了整合多种防御策略以提升安全性的潜在优势。本研究通过采用四种前沿攻击技术及评估三种新型防御方法的有效性,对开源模型(如LLaMA和Mistral)与闭源模型(如GPT-4)进行了系统性评估。