The widespread dissemination of Deepfakes demands effective approaches that can detect perceptually convincing forged images. In this paper, we aim to capture the subtle manipulation artifacts at different scales using transformer models. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which operates on patches of different sizes to detect local inconsistencies in images at different spatial levels. M2TR further learns to detect forgery artifacts in the frequency domain to complement RGB information through a carefully designed cross modality fusion block. In addition, to stimulate Deepfake detection research, we introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods. We conduct extensive experiments to verify the effectiveness of the proposed method, which outperforms state-of-the-art Deepfake detection methods by clear margins.
翻译:Deepfakes的广泛传播要求采取有效方法,能够检测感知到令人信服的伪造图像。在本文件中,我们的目标是利用变压器模型在不同尺度上捕捉微妙的操纵工艺品。特别是,我们引入了多式多尺寸TRansex(M2TR),在不同的空间水平上对不同尺寸的图像进行修补,以探测本地的不一致之处。M2TR进一步学会在频率域内探测伪造工艺品,以便通过精心设计的跨模式聚合块来补充RGB信息。此外,为了刺激Deepfake探测研究,我们引入了一个高质量的深藏式数据集,SR-DF,其中包括由最先进的面部转换和面部重新激活方法产生的4,000个深藏式视频。我们进行了广泛的实验,以核实拟议方法的有效性,该方法在清晰的边距上比最新先进的深藏式探测方法要好。