用于深假探测的深革命集合变换器 (Deep Convolutional Pooling Transformer for Deepfake Detection)

Recently, Deepfake has drawn considerable public attention due to security and privacy concerns in social media digital forensics. As the wildly spreading Deepfake videos on the Internet become more realistic, traditional detection techniques have failed in distinguishing between real and fake. Most existing deep learning methods mainly focus on local features and relations within the face image using convolutional neural networks as a backbone. However, local features and relations are insufficient for model training to learn enough general information for Deepfake detection. Therefore, the existing Deepfake detection methods have reached a bottleneck to further improve the detection performance. To address this issue, we propose a deep convolutional Transformer to incorporate the decisive image features both locally and globally. Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy. Moreover, we employ the barely discussed image keyframes in model training for performance improvement and visualize the feature quantity gap between the key and normal image frames caused by video compression. We finally illustrate the transferability with extensive experiments on several Deepfake benchmark datasets. The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.

翻译：最近,Deepfake吸引了公众对社交媒体数字法证中安全和隐私问题的大量关注。随着互联网上疯狂传播的Depfake视频变得更加现实,传统探测技术未能区分真实和假相。大多数现有的深层学习方法主要侧重于面部图像中的当地特征和关系,使用进化神经网络作为主干线。然而,本地特征和关系不足以进行示范培训,以学习足够的一般信息,供深层假相探测。因此,现有的深层假相探测方法已经到了一个瓶颈,以进一步改善探测性能。为了解决这一问题,我们提议了一个深层革命变异器,以纳入本地和全球的决定性图像特征。具体地说,我们采用共进式集,并重新试图丰富提取的特征,提高效果。此外,我们很少在改进性能的示范培训中使用图像关键框架,并直观视频压缩产生的关键图像框架与正常图像框架之间的特征数量差距。我们最后通过对几个Deepfake基准数据集的广泛实验来说明可转移性。我们提出的解决方案始终超越了内部和交叉数据实验的若干州级基准基线。