多模态媒体篡改的检测与定位 (Detecting and Grounding Multi-Modal Media Manipulation)

from arxiv, CVPR 2023. Project page: https://rshaojimmy.github.io/Projects/MultiModal-DeepFake Code: https://github.com/rshaojimmy/MultiModal-DeepFake

Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM^4). DGM^4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content (i.e., image bounding boxes and text tokens), which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM^4 dataset, where image-text pairs are manipulated by various approaches, with rich annotation of diverse manipulations. Moreover, we propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities. HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of our model; several valuable observations are also revealed to facilitate future research in multi-modal media manipulation.

翻译：虚假信息已成为一个紧迫的问题。假媒体在网络上无处不在，包括视觉和文本形式。虽然已经提出了各种深度伪造检测和文本虚假新闻检测方法，但它们只为单模式伪造设计，基于二元分类，更不用说分析和推理跨不同模态的微妙伪造痕迹了。在本文中，我们强调了一个新的多模态虚假媒体研究问题，即检测和定位多模态媒体篡改 (DGM^4)。DGM^4不仅旨在检测多模态媒体的真实性，而且还要定位篡改内容 (即图像边框框和文本单词)，这需要更深入的多模态媒体篡改推理。为了支持大规模研究，我们构建了第一个DGM^4数据集，其中图像-文本对通过各种方法进行了篡改，并且有丰富的不同篡改的标注。此外，我们提出了一种新的HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER)，以完全捕捉不同模态之间的细粒度交互。HAMMER执行以下操作: 1) 基于浅层篡改推理，两个单模型编码器之间的篡改感知对比学习；2) 基于深层篡改推理，多模态聚合器执行模态感知的交叉注意力。从浅到深基于交互多模态信息准备专用的篡改检测和定位头。最后，我们建立了全面的基准，并设置了严格的评估指标来解决这一新的研究问题。综合实验证明了我们的模型的优越性；还揭示了一些有价值的观察结果，有助于未来的多模态媒体篡改研究。