WebShell attacks, where malicious scripts are injected into web servers, pose a significant cybersecurity threat. Traditional ML and DL methods are often hampered by challenges such as the need for extensive training data, catastrophic forgetting, and poor generalization. Recently, Large Language Models have emerged as powerful alternatives for code-related tasks, but their potential in WebShell detection remains underexplored. In this paper, we make two contributions: (1) a comprehensive evaluation of seven LLMs, including GPT-4, LLaMA 3.1 70B, and Qwen 2.5 variants, benchmarked against traditional sequence- and graph-based methods using a dataset of 26.59K PHP scripts, and (2) the Behavioral Function-Aware Detection (BFAD) framework, designed to address the specific challenges of applying LLMs to this domain. Our framework integrates three components: a Critical Function Filter that isolates malicious PHP function calls, a Context-Aware Code Extraction strategy that captures the most behaviorally indicative code segments, and Weighted Behavioral Function Profiling that enhances in-context learning by prioritizing the most relevant demonstrations based on discriminative function-level profiles. Our results show that, stemming from their distinct analytical strategies, larger LLMs achieve near-perfect precision but lower recall, while smaller models exhibit the opposite trade-off. However, all baseline models lag behind previous SOTA methods. With the application of BFAD, the performance of all LLMs improves significantly, yielding an average F1 score increase of 13.82%. Notably, larger models now outperform SOTA benchmarks, while smaller models such as Qwen-2.5-Coder-3B achieve performance competitive with traditional methods. This work is the first to explore the feasibility and limitations of LLMs for WebShell detection and provides solutions to address the challenges in this task.
翻译:WebShell攻击通过向Web服务器注入恶意脚本构成重大网络安全威胁。传统机器学习和深度学习方法常受限于对大量训练数据的需求、灾难性遗忘及泛化能力差等挑战。近年来,大型语言模型已成为代码相关任务的有力替代方案,但其在WebShell检测领域的潜力尚未得到充分探索。本文作出两项贡献:(1)对包括GPT-4、LLaMA 3.1 70B及Qwen 2.5系列在内的七种大型语言模型进行全面评估,基于包含26.59K个PHP脚本的数据集,与传统基于序列和图的方法进行基准测试;(2)提出行为函数感知检测框架,专门针对大型语言模型在该领域应用的特殊挑战而设计。该框架集成三个核心组件:隔离恶意PHP函数调用的关键函数过滤器、捕获最具行为表征性代码段的上下文感知代码提取策略,以及通过基于判别性函数级特征优先选择最相关示例来增强上下文学习的加权行为函数分析机制。实验结果表明:由于分析策略的差异,较大规模的大型语言模型可实现接近完美的精确率但召回率较低,而较小规模模型则呈现相反的权衡特性。然而,所有基线模型均落后于先前的最先进方法。通过应用行为函数感知检测框架,所有大型语言模型的性能均得到显著提升,平均F1分数提高13.82%。值得注意的是,较大规模模型现已超越最先进基准,而Qwen-2.5-Coder-3B等较小规模模型也能达到与传统方法相竞争的性能水平。本研究首次系统探索了大型语言模型在WebShell检测任务中的可行性与局限性,并为应对该任务的挑战提供了解决方案。