Large Language Models (LLMs) are transforming software engineering tasks, including code vulnerability detection-a critical area of software security. However, existing methods often rely on resource-intensive models or graph-based techniques, limiting their accessibility and practicality. This paper introduces K-ASTRO, a lightweight Transformer model that combines semantic embeddings from LLMs with structural features of Abstract Syntax Trees (ASTs) to improve both efficiency and accuracy in code vulnerability detection. Our approach introduces an AST-based augmentation technique inspired by mutation testing, a structure-aware attention mechanism that incorporates augmented AST features, and a joint adaptation pipeline to unify code semantics and syntax. Experimental results on three large-scale datasets, including BigVul, DiverseVul, and PrimeVul-demonstrate state-of-the-art performance while enabling rapid inference on CPUs with minimal training time. By offering a scalable, interpretable, and efficient solution, K-ASTRO bridges the gap between LLM advancements and practical software vulnerability detection, providing open-sourced tools to foster further research.
翻译:大型语言模型(LLMs)正在变革软件工程任务,其中代码漏洞检测作为软件安全的关键领域尤为突出。然而,现有方法通常依赖资源密集型模型或基于图的技术,限制了其可访问性与实用性。本文提出K-ASTRO——一种轻量级Transformer模型,通过融合LLMs的语义嵌入与抽象语法树(ASTs)的结构特征,以提升代码漏洞检测的效率和准确性。我们的方法引入了三项核心技术:受变异测试启发的基于AST的数据增强技术、融合增强AST特征的结构感知注意力机制,以及统一代码语义与语法的联合自适应流水线。在BigVul、DiverseVul和PrimeVul三个大规模数据集上的实验结果表明,该方法在实现最先进性能的同时,能够在CPU上实现快速推理且训练时间极短。K-ASTRO通过提供可扩展、可解释且高效的解决方案,弥合了LLM前沿进展与实际软件漏洞检测需求之间的鸿沟,并开源相关工具以促进后续研究。