The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.
翻译:音视频伪造的威胁正迅速从以人为中心的深度伪造扩展到涵盖复杂自然场景中更多样化的篡改。然而,现有基准仍局限于基于深度伪造的篡改和单一粒度标注,因而无法捕捉真实世界伪造场景的多样性和复杂性。为此,我们提出了AVFakeBench,这是首个全面的音视频伪造检测基准,涵盖了人类主体和一般主体的丰富伪造语义。AVFakeBench包含12K个精心策划的音视频问题,覆盖七种伪造类型和四个标注层级。为确保高质量和多样化的伪造内容,我们提出了一种多阶段混合伪造框架,该框架集成了用于任务规划的专有模型与用于精确操作的专家生成模型。该基准建立了一个多任务评估框架,涵盖二元判断、伪造类型分类、伪造细节选择和解释性推理。我们在AVFakeBench上评估了11个音视频大语言模型和2种主流检测方法,展示了AV-LMMs作为新兴伪造检测器的潜力,同时也揭示了它们在细粒度感知和推理方面的显著不足。