Recently, Deepfake has drawn considerable public attention due to security and privacy concerns in social media digital forensics. As the wildly spreading Deepfake videos on the Internet become more realistic, traditional detection techniques have failed in distinguishing between real and fake. Most existing deep learning methods mainly focus on local features and relations within the face image using convolutional neural networks as a backbone. However, local features and relations are insufficient for model training to learn enough general information for Deepfake detection. Therefore, the existing Deepfake detection methods have reached a bottleneck to further improve the detection performance. To address this issue, we propose a deep convolutional Transformer to incorporate the decisive image features both locally and globally. Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy. Moreover, we employ the barely discussed image keyframes in model training for performance improvement and visualize the feature quantity gap between the key and normal image frames caused by video compression. We finally illustrate the transferability with extensive experiments on several Deepfake benchmark datasets. The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
翻译:最近,Deepfake由于其在社交媒体数字取证方面的安全和隐私问题而引起了广泛的公众关注。随着在互联网上广泛传播的Deepfake视频变得更加逼真,传统的检测技术已经无法区分真假。大多数已有的深度学习方法主要集中在使用卷积神经网络作为骨干的局部特征和面部图像之间的关系上。然而,局部特征和关系对于模型训练来说不足以学习足够的一般信息来检测Deepfake。因此,现有的Deepfake检测方法已经达到了进一步改善检测性能的瓶颈。为了解决这个问题,我们提出了一种深度卷积Transformer来同时融合决定性的图像特征。具体而言,我们应用卷积池化和再注意机制来丰富提取的特征并增强效率。此外,我们在模型训练中使用了很少被探讨的图像关键帧以提高性能,并展示受视频压缩的关键帧和普通图像帧之间的特征数量差距。最后,我们通过在多个Deepfake基准数据集上的广泛实验证明了其可转移性。所提出的解决方案在不同数据集上的实验中始终优于几个最先进的基线。