The adoption of Large Language Models (LLMs) for code generation risks incorporating vulnerable code into software systems. Existing detectors face two critical limitations: a lack of systematic cross-model validation and opaque "black box" operation. We address this through a comparative study of code generated by four distinct LLMs: GPT-3.5, Claude 3 Haiku, Claude Haiku 4.5, and GPT-OSS. Analyzing 14,485 Python functions and 11,913 classes from the CodeSearchNet dataset, we generated corresponding code with all four LLMs. Using interpretable software metrics, we trained CatBoost classifiers for each configuration. Our analysis reveals that granularity effects dominate model differences by a factor of 8.6, with negligible feature overlap, indicating that function-level and class-level detection rely on fundamentally disjoint structural signatures. We discover critical granularity-dependent inversions: while modern models (Claude, GPT-OSS) are more detectable at the class level, GPT-3.5 is an anomaly that uniquely excels at the function level. SHAP analysis identifies the Comment-to-Code Ratio as the sole universal discriminator. However, its predictive magnitude varies drastically across models, explaining why detectors trained on specific LLMs fail to generalize. Our findings demonstrate that GPT-3.5's exceptional detectability (AUC-ROC 0.96) is unrepresentative of contemporary models (AUC-ROC approximately between 0.68 and 0.80). Robust detection requires moving beyond single-model studies to account for substantial diversity in structural fingerprints across architectures and granularities.
翻译:采用大语言模型(LLMs)生成代码存在将易受攻击的代码引入软件系统的风险。现有检测器面临两个关键局限:缺乏系统性的跨模型验证以及不透明的“黑箱”操作。我们通过对四种不同LLM生成的代码进行比较研究来解决这一问题:GPT-3.5、Claude 3 Haiku、Claude Haiku 4.5和GPT-OSS。通过分析来自CodeSearchNet数据集的14,485个Python函数和11,913个类,我们使用全部四种LLM生成了对应代码。利用可解释的软件度量指标,我们为每种配置训练了CatBoost分类器。我们的分析表明,粒度效应以8.6倍的优势主导模型差异,且特征重叠可忽略不计,这表明函数级和类级检测依赖于本质上互不相交的结构特征。我们发现了关键的粒度依赖性反转现象:虽然现代模型(Claude、GPT-OSS)在类级别更易被检测,但GPT-3.5是一个异常值,在函数级别表现出独特的优势。SHAP分析将注释-代码比率识别为唯一的通用判别特征。然而,其预测强度在不同模型间差异巨大,这解释了针对特定LLM训练的检测器为何无法泛化。我们的研究结果表明,GPT-3.5的异常可检测性(AUC-ROC 0.96)并不能代表当代模型(AUC-ROC约在0.68至0.80之间)。要实现稳健检测,必须超越单模型研究,充分考虑不同架构和粒度间结构特征的显著多样性。