基于LLM的安全代码审查：能力、障碍与影响因素 (An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors)

Security code review is a time-consuming and labor-intensive process typically requiring integration with automated security defect detection tools. However, existing security analysis tools struggle with poor generalization, high false positive rates, and coarse detection granularity. Large Language Models (LLMs) have been considered promising candidates for addressing those challenges. In this study, we conducted an empirical study to explore the potential of LLMs in detecting security defects during code review. Specifically, we evaluated the performance of seven LLMs under five different prompts and compared them with state-of-the-art static analysis tools. We also performed linguistic and regression analyses for the two top-performing LLMs to identify quality problems in their responses and factors influencing their performance. Our findings show that: (1) In security code review, LLMs significantly outperform state-of-the-art static analysis tools, and the reasoning-optimized LLM performs better than general-purpose LLMs. (2) DeepSeek-R1 achieves the highest performance, followed by GPT-4. The optimal prompt for DeepSeek-R1 incorporates both the commit message and chain-of-thought (CoT) guidance, while for GPT-4, the prompt with a Common Weakness Enumeration (CWE) list works best. (3) GPT-4 frequently produces vague expressions and exhibits difficulties in accurately following instructions in the prompts, while DeepSeek-R1 more commonly generates inaccurate code details in its outputs. (4) LLMs are more adept at identifying security defects in code files that have fewer tokens and security-relevant annotations.

翻译：安全代码审查是一项耗时费力的过程，通常需要与自动化安全缺陷检测工具相结合。然而，现有的安全分析工具普遍面临泛化能力差、误报率高和检测粒度粗等问题。大型语言模型（LLMs）被视为应对这些挑战的潜在解决方案。本研究通过实证研究，探讨了LLMs在代码审查过程中检测安全缺陷的潜力。具体而言，我们评估了七种LLM在五种不同提示下的性能，并将其与最先进的静态分析工具进行了比较。此外，我们对两种表现最佳的LLM进行了语言学和回归分析，以识别其响应中的质量问题及影响其性能的因素。我们的研究发现：（1）在安全代码审查任务中，LLMs显著优于最先进的静态分析工具，且经过推理优化的LLM表现优于通用型LLM。（2）DeepSeek-R1取得了最佳性能，GPT-4次之。对于DeepSeek-R1，结合提交信息和思维链（CoT）引导的提示效果最优；而对于GPT-4，包含通用缺陷枚举（CWE）列表的提示表现最佳。（3）GPT-4的响应中频繁出现模糊表述，且在准确遵循提示指令方面存在困难；而DeepSeek-R1则更常在输出中生成不准确的代码细节。（4）LLMs更擅长在代码标记较少且包含安全相关注释的代码文件中识别安全缺陷。