评估大型语言模型在行级漏洞定位中的性能 (Evaluating Large Language Models for Line-Level Vulnerability Localization)

Recently, Automated Vulnerability Localization (AVL) has attracted growing attention, aiming to facilitate diagnosis by pinpointing the specific lines of code responsible for vulnerabilities. Large Language Models (LLMs) have shown potential in various domains, yet their effectiveness in line-level vulnerability localization remains underexplored. In this work, we present the first comprehensive empirical evaluation of LLMs for AVL. Our study examines 19 leading LLMs suitable for code analysis, including ChatGPT and multiple open-source models, spanning encoder-only, encoder-decoder, and decoder-only architectures, with model sizes from 60M to 70B parameters. We evaluate three paradigms including few-shot prompting, discriminative fine-tuning, and generative fine-tuning with and without Low-Rank Adaptation (LoRA), on both a BigVul-derived dataset for C/C++ and a smart contract vulnerability dataset.} Our results show that discriminative fine-tuning achieves substantial performance gains over existing learning-based AVL methods when sufficient training data is available. In low-data settings, prompting advanced LLMs such as ChatGPT proves more effective. We also identify challenges related to input length and unidirectional context during fine-tuning, and propose two remedial strategies: a sliding window approach and right-forward embedding, both of which yield significant improvements. Moreover, we provide the first assessment of LLM generalizability in AVL, showing that certain models can transfer effectively across Common Weakness Enumerations (CWEs) and projects. However, performance degrades notably for newly discovered vulnerabilities containing unfamiliar lexical or structural patterns, underscoring the need for continual adaptation.

翻译：近年来，自动化漏洞定位技术日益受到关注，其目标是通过精确定位导致漏洞的特定代码行来辅助诊断。大型语言模型已在多个领域展现出潜力，但其在行级漏洞定位任务中的有效性尚未得到充分探索。本研究首次对LLM在AVL任务中的表现进行了全面实证评估。我们考察了19种适用于代码分析的领先LLM，包括ChatGPT和多种开源模型，涵盖仅编码器、编码器-解码器和仅解码器架构，模型参数量从6000万到700亿不等。我们在基于BigVul的C/C++数据集和智能合约漏洞数据集上，评估了少样本提示、判别式微调以及结合/不结合低秩自适应技术的生成式微调三种范式。实验结果表明：当训练数据充足时，判别式微调相较于现有基于学习的AVL方法能获得显著的性能提升；在低数据场景下，对ChatGPT等先进LLM进行提示工程更为有效。同时，我们发现了微调过程中输入长度限制和单向上下文带来的挑战，并提出滑动窗口方法和右向前向嵌入两种改进策略，均实现了显著性能提升。此外，我们首次评估了LLM在AVL任务中的泛化能力，发现某些模型能够有效跨CWE类别和项目进行迁移。然而，对于包含陌生词汇或结构模式的新发现漏洞，模型性能会出现明显下降，这凸显了持续自适应的重要性。