The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The project page is available at: https://11ouo1.github.io/edvd-llama/.
翻译:深度伪造视频技术的快速发展不仅促进了艺术创作,也使得虚假信息的传播更为容易。传统的深度伪造视频检测方法面临着原理缺乏透明度、泛化能力不足以应对不断演变的伪造技术等问题。这凸显了对既能识别伪造内容又能提供可验证推理解释的检测器的迫切需求。本文提出了可解释深度伪造视频检测任务,并设计了EDVD-LLaMA多模态大语言模型推理框架,该框架在提供准确检测结果和可信解释的同时,还提供了可追溯的推理过程。我们的方法首先引入了时空细微信息标记化模块,以提取并融合全局与局部跨帧的深度伪造特征,为MLLM推理提供丰富的时空语义信息输入。其次,我们构建了细粒度多模态思维链机制,在推理过程中引入面部特征数据作为硬约束,以实现像素级的时空视频定位,抑制幻觉输出,并增强思维链的可靠性。此外,我们构建了可解释推理FF++数据集,利用结构化数据对视频进行标注并确保质量控制,从而为推理和检测提供双重监督。大量实验表明,EDVD-LLaMA在检测准确性、可解释性以及处理跨伪造方法和跨数据集场景的能力方面均表现出卓越的性能和鲁棒性。与以往的DVD方法相比,它提供了一个更具可解释性且更优的解决方案。项目页面位于:https://11ouo1.github.io/edvd-llama/。