The rapid development of large reasoning models (LRMs), such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models~(LLMs). However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging established safety benchmarks to evaluate their compliance with safety regulations. Furthermore, we investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications. Through our multi-faceted analysis, we uncover four key findings: (1) There is a significant safety gap between the open-source reasoning models and the o3-mini model, on both safety benchmark and attack, suggesting more safety effort on open LRMs is needed. (2) The stronger the model's reasoning ability, the greater the potential harm it may cause when answering unsafe questions. (3) Safety thinking emerges in the reasoning process of LRMs, but fails frequently against adversarial attacks. (4) The thinking process in R1 models poses greater safety concerns than their final answers. Our study provides insights into the security implications of reasoning models and highlights the need for further advancements in R1 models' safety to close the gap.
翻译:大型推理模型(如OpenAI-o3和DeepSeek-R1)的快速发展,相较于非推理型大语言模型(LLMs),在复杂推理任务上取得了显著进步。然而,这些模型增强的能力,加之像DeepSeek-R1这类模型的开源可访问性,引发了严重的安全担忧,尤其是在其潜在滥用风险方面。本研究对这些推理模型进行了全面的安全性评估,利用既有的安全基准测试来评估其是否符合安全规范。此外,我们还探究了它们对对抗性攻击(如越狱和提示注入)的易感性,以评估其在现实应用中的鲁棒性。通过多维度分析,我们揭示了四个关键发现:(1)开源推理模型与o3-mini模型在安全基准测试和攻击测试上存在显著的安全差距,表明开源大型推理模型需要更多的安全投入。(2)模型的推理能力越强,在回答不安全问题时可能造成的潜在危害越大。(3)大型推理模型的推理过程中会出现安全思考,但在面对对抗性攻击时经常失效。(4)R1模型的思考过程比其最终答案带来更大的安全担忧。本研究为推理模型的安全影响提供了见解,并强调了需要进一步改进R1模型的安全性以缩小这一差距。