Fine-grained detection and localization of localized image edits is crucial for assessing content authenticity, especially as modern diffusion models and image editors can produce highly realistic manipulations. However, this problem faces three key challenges: (1) most AIGC detectors produce only a global real-or-fake label without indicating where edits occur; (2) traditional computer vision methods for edit localization typically rely on costly pixel-level annotations; and (3) there is no large-scale, modern benchmark specifically targeting edited-image detection. To address these gaps, we develop an automated data-generation pipeline and construct FragFake, a large-scale benchmark of AI-edited images spanning multiple source datasets, diverse editing models, and several common edit types. Building on FragFake, we are the first to systematically study vision language models (VLMs) for edited-image classification and edited-region localization. Our experiments show that pretrained VLMs, including GPT4o, perform poorly on this task, whereas fine-tuned models such as Qwen2.5-VL achieve high accuracy and substantially higher object precision across all settings. We further explore GRPO-based RLVR training, which yields modest metric gains while improving the interpretability of model outputs. Ablation and transfer analyses reveal how data balancing, training size, LoRA rank, and training domain affect performance, and highlight both the potential and the limitations of cross-editor and cross-dataset generalization. We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.
翻译:对局部图像编辑进行细粒度检测与定位对于评估内容真实性至关重要,尤其是在现代扩散模型和图像编辑器能够生成高度逼真篡改内容的背景下。然而,该问题面临三个关键挑战:(1)多数AIGC检测器仅生成全局真伪标签,无法指示编辑发生的位置;(2)传统的编辑定位计算机视觉方法通常依赖昂贵的像素级标注;(3)缺乏专门针对编辑图像检测的大规模现代基准数据集。为弥补这些不足,我们开发了自动化数据生成流程,构建了FragFake——一个涵盖多源数据集、多样化编辑模型及多种常见编辑类型的大规模AI编辑图像基准。基于FragFake,我们首次系统研究了视觉语言模型(VLMs)在编辑图像分类与编辑区域定位任务中的表现。实验表明,包括GPT4o在内的预训练VLMs在此任务上表现欠佳,而经过微调的模型(如Qwen2.5-VL)则在所有设定中实现了高精度和显著提升的目标定位准确率。我们进一步探索基于GRPO的RLVR训练方法,该方法在提升模型输出可解释性的同时获得了适度的指标提升。消融实验与迁移分析揭示了数据平衡、训练规模、LoRA秩及训练域对性能的影响机制,并同时凸显了跨编辑器与跨数据集泛化能力的潜力与局限。我们预期本工作将为多模态内容真实性领域的后续研究奠定坚实基础并提供启发。