Vision Language Models (VLMs) are increasingly adopted for AI-generated images (AIGI) detection, yet converting VLMs into detectors requires substantial resource, while the resulting models still exhibit severe hallucinations. To probe the core issue, we conduct an empirical analysis and observe two characteristic behaviors: (i) fine-tuning VLMs on high-level semantic supervision strengthens semantic discrimination and well generalize to unseen data; (ii) fine-tuning VLMs on low-level pixel-artifact supervision yields poor transfer. We attribute VLMs' underperformance to task-model misalignment: semantics-oriented VLMs inherently lack sensitivity to fine-grained pixel artifacts, and semantically non-discriminative pixel artifacts thus exceeds their inductive biases. In contrast, we observe that conventional pixel-artifact detectors capture low-level pixel artifacts yet exhibit limited semantic awareness relative to VLMs, highlighting that distinct models are better matched to distinct tasks. In this paper, we formalize AIGI detection as two complementary tasks--semantic consistency checking and pixel-artifact detection--and show that neglecting either induces systematic blind spots. Guided by this view, we introduce the Task-Model Alignment principle and instantiate it as a two-branch detector, AlignGemini, comprising a VLM fine-tuned exclusively with pure semantic supervision and a pixel-artifact expert trained exclusively with pure pixel-artifact supervision. By enforcing orthogonal supervision on two simplified datasets, each branch trains to its strengths, producing complementary discrimination over semantic and pixel cues. On five in-the-wild benchmarks, AlignGemini delivers a +9.5 gain in average accuracy, supporting task-model alignment as an effective path to generalizable AIGI detection.
翻译:视觉语言模型(VLMs)在AI生成图像(AIGI)检测中的应用日益广泛,但将VLMs转化为检测器需要大量资源,且所得模型仍存在严重的幻觉问题。为探究核心问题,我们进行了实证分析并观察到两种典型行为:(i)基于高层语义监督对VLMs进行微调可增强语义判别能力,并能良好泛化至未见数据;(ii)基于低层像素伪影监督对VLMs进行微调则导致较差的迁移性能。我们将VLMs的欠佳表现归因于任务-模型错位:面向语义的VLMs本质上缺乏对细粒度像素伪影的敏感性,因此语义上无区分度的像素伪影超出了其归纳偏置。相比之下,我们观察到传统的像素伪影检测器虽能捕捉低层像素伪影,但相对于VLMs其语义感知能力有限,这表明不同模型更适配于不同任务。本文中,我们将AIGI检测形式化为两个互补任务——语义一致性检验与像素伪影检测——并证明忽略任一任务都会导致系统性盲区。基于此视角,我们提出任务-模型对齐原则,并将其实例化为一个双分支检测器AlignGemini,包含一个仅使用纯语义监督微调的VLM和一个仅使用纯像素伪影监督训练的像素伪影专家模块。通过在两个简化数据集上施加正交监督,每个分支得以发挥其优势,产生对语义线索和像素线索的互补判别能力。在五个真实场景基准测试中,AlignGemini实现了平均准确率+9.5%的提升,证实任务-模型对齐是实现可泛化AIGI检测的有效路径。