VULPO：基于在线策略大语言模型优化的上下文感知漏洞检测 (VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization)

The widespread reliance on open-source software dramatically increases the risk of vulnerability exploitation, underscoring the need for effective and scalable vulnerability detection (VD). Existing VD techniques, whether traditional machine learning-based or LLM-based approaches like prompt engineering, supervised fine-tuning, or off-policy preference optimization, remain fundamentally limited in their ability to perform context-aware analysis: They depend on fixed inputs or static preference datasets, cannot adaptively explore repository-level dependencies, and are constrained by function-level benchmarks that overlook critical vulnerability context. This paper introduces Vulnerability-Adaptive Policy Optimization (VULPO), an on-policy LLM reinforcement learning framework for context-aware VD. To support training and evaluation, we first construct ContextVul, a new dataset that augments high-quality function-level samples with lightweight method to extract repository-level context information. We then design multi-dimensional reward structuring that jointly captures prediction correctness, vulnerability localization accuracy, and the semantic relevance of vulnerability analysis, thereby guiding the model toward comprehensive contextual reasoning. To address the asymmetric difficulty of different vulnerability cases and mitigate reward hacking, VULPO incorporates label-level and sample-level difficulty-adaptive reward scaling, encouraging the model to explore challenging cases while maintaining balanced reward distribution. Extensive experiments demonstrate the superiority of our VULPO framework in context-aware VD: Our VULPO-4B substantially outperforms existing VD baselines based on prompt engineering and off-policy optimization, improving F1 by 85% over Qwen3-4B and achieving performance comparable to a 150x larger-scale model, DeepSeek-R1-0528.

翻译：对开源软件的广泛依赖显著增加了漏洞利用风险，凸显了对高效且可扩展的漏洞检测（VD）技术的迫切需求。现有的VD技术，无论是基于传统机器学习的方法，还是基于大语言模型（LLM）的提示工程、监督微调或离线策略偏好优化等方法，在执行上下文感知分析方面仍存在根本性局限：它们依赖于固定输入或静态偏好数据集，无法自适应地探索仓库级依赖关系，且受限于忽略关键漏洞上下文的函数级基准测试。本文提出了漏洞自适应策略优化（VULPO），一种用于上下文感知VD的在线策略LLM强化学习框架。为支持训练与评估，我们首先构建了ContextVul数据集，该数据集通过轻量级方法提取仓库级上下文信息，对高质量函数级样本进行了增强。随后，我们设计了多维奖励结构，共同捕捉预测准确性、漏洞定位精度以及漏洞分析的语义相关性，从而引导模型进行全面的上下文推理。为解决不同漏洞案例的非对称难度并缓解奖励欺骗问题，VULPO引入了标签级与样本级的难度自适应奖励缩放机制，鼓励模型探索具有挑战性的案例，同时保持奖励分布的均衡性。大量实验证明了我们的VULPO框架在上下文感知VD方面的优越性：我们的VULPO-4B模型显著超越了基于提示工程和离线策略优化的现有VD基线，其F1分数较Qwen3-4B提升了85%，并达到了与规模大150倍的DeepSeek-R1-0528模型相当的性能。