Deepfake media is becoming widespread nowadays because of the easily available tools and mobile apps which can generate realistic looking deepfake videos/images without requiring any technical knowledge. With further advances in this field of technology in the near future, the quantity and quality of deepfake media is also expected to flourish, while making deepfake media a likely new practical tool to spread mis/disinformation. Because of these concerns, the deepfake media detection tools are becoming a necessity. In this study, we propose a novel hybrid transformer network utilizing early feature fusion strategy for deepfake video detection. Our model employs two different CNN networks, i.e., (1) XceptionNet and (2) EfficientNet-B4 as feature extractors. We train both feature extractors along with the transformer in an end-to-end manner on FaceForensics++, DFDC benchmarks. Our model, while having relatively straightforward architecture, achieves comparable results to other more advanced state-of-the-art approaches when evaluated on FaceForensics++ and DFDC benchmarks. Besides this, we also propose novel face cut-out augmentations, as well as random cut-out augmentations. We show that the proposed augmentations improve the detection performance of our model and reduce overfitting. In addition to that, we show that our model is capable of learning from considerably small amount of data.
翻译:现在,深假媒体正在变得广泛,因为容易获得的工具和移动应用程序可以产生现实的深假视频/图像,而不需要任何技术知识。随着这个技术领域的进一步发展,深假媒体的数量和质量也有望蓬勃发展,同时使深假媒体成为可能传播错误/错误信息的新的实用工具。由于这些关切,深假媒体检测工具变得十分必要。在本研究中,我们提议建立一个新型混合变异器网络,利用早期特征融合战略来进行深假视频探测。我们的模式使用两种不同的CNN网络,即:(1) XceptionNet和(2) 高效Net-B4作为地貌提取器。我们用端到端的方式,对两种地貌提取器和变异器进行端到端方式的培训。我们的模式虽然结构比较简单,但在对FaceForemiccs+和DFDC基准进行评估时,却取得了与其他更先进的最新技术方法相近的结果。除此之外,我们还提议采用两种不同的CNN网络,即:(1) XceptionNet和(2) effect-Net-B4作为地提取器。我们在Face FaceForenses+++++++和DDDC基准中进行小切的模型切的测试,我们拟议升级的升级的升级的升级和升级的升级的升级,以显示,从而大大地显示我们的升级的升级的升级的升级和升级的升级的升级的升级。