大型语言模型能否胜任WebShell检测？基于行为函数感知框架的检测挑战应对 (Can LLMs Handle WebShell Detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework)

WebShell attacks, where malicious scripts are injected into web servers, pose a significant cybersecurity threat. Traditional ML and DL methods are often hampered by challenges such as the need for extensive training data, catastrophic forgetting, and poor generalization. Recently, Large Language Models have emerged as powerful alternatives for code-related tasks, but their potential in WebShell detection remains underexplored. In this paper, we make two contributions: (1) a comprehensive evaluation of seven LLMs, including GPT-4, LLaMA 3.1 70B, and Qwen 2.5 variants, benchmarked against traditional sequence- and graph-based methods using a dataset of 26.59K PHP scripts, and (2) the Behavioral Function-Aware Detection (BFAD) framework, designed to address the specific challenges of applying LLMs to this domain. Our framework integrates three components: a Critical Function Filter that isolates malicious PHP function calls, a Context-Aware Code Extraction strategy that captures the most behaviorally indicative code segments, and Weighted Behavioral Function Profiling that enhances in-context learning by prioritizing the most relevant demonstrations based on discriminative function-level profiles. Our results show that, stemming from their distinct analytical strategies, larger LLMs achieve near-perfect precision but lower recall, while smaller models exhibit the opposite trade-off. However, all baseline models lag behind previous SOTA methods. With the application of BFAD, the performance of all LLMs improves significantly, yielding an average F1 score increase of 13.82%. Notably, larger models now outperform SOTA benchmarks, while smaller models such as Qwen-2.5-Coder-3B achieve performance competitive with traditional methods. This work is the first to explore the feasibility and limitations of LLMs for WebShell detection and provides solutions to address the challenges in this task.

翻译：WebShell攻击通过向Web服务器注入恶意脚本构成重大网络安全威胁。传统机器学习和深度学习方法常受限于对大量训练数据的需求、灾难性遗忘及泛化能力差等挑战。近年来，大型语言模型已成为代码相关任务的有力替代方案，但其在WebShell检测领域的潜力尚未得到充分探索。本文作出两项贡献：（1）对包括GPT-4、LLaMA 3.1 70B及Qwen 2.5系列在内的七种大型语言模型进行全面评估，基于包含26.59K个PHP脚本的数据集，与传统基于序列和图的方法进行基准测试；（2）提出行为函数感知检测框架，专门针对大型语言模型在该领域应用的特殊挑战而设计。该框架集成三个核心组件：隔离恶意PHP函数调用的关键函数过滤器、捕获最具行为表征性代码段的上下文感知代码提取策略，以及通过基于判别性函数级特征优先选择最相关示例来增强上下文学习的加权行为函数分析机制。实验结果表明：由于分析策略的差异，较大规模的大型语言模型可实现接近完美的精确率但召回率较低，而较小规模模型则呈现相反的权衡特性。然而，所有基线模型均落后于先前的最先进方法。通过应用行为函数感知检测框架，所有大型语言模型的性能均得到显著提升，平均F1分数提高13.82%。值得注意的是，较大规模模型现已超越最先进基准，而Qwen-2.5-Coder-3B等较小规模模型也能达到与传统方法相竞争的性能水平。本研究首次系统探索了大型语言模型在WebShell检测任务中的可行性与局限性，并为应对该任务的挑战提供了解决方案。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日