Deepfakes are the result of digital manipulation to obtain credible videos in order to deceive the viewer. This is done through deep learning techniques based on autoencoders or GANs that become more accessible and accurate year after year, resulting in fake videos that are very difficult to distinguish from real ones. Traditionally, CNN networks have been used to perform deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC).
翻译:深假是数字操纵的结果,目的是获取可信的视频,以欺骗观众。这是通过基于自动编码器或GAN的深层次学习技术实现的,这些技术年复一年地变得更加容易接触和准确,导致假视频很难与真实视频区分。 传统上,CNN网络被用于进行深假检测,使用高效网络B7. 这项研究中,我们把各种类型的愿景变换器与用作特征提取器的集成高效网络B0结合起来,与一些使用视野变异器的最新方法取得类似的结果。 不同于最先进的方法,我们既不使用蒸馏法,也不使用共具法。 最佳模型实现了0.951AUC和88.0%的F1分,非常接近于深福克探测挑战(DDC)的最新技术。