The rapid advancement of deep learning models that can generate and synthesis hyper-realistic videos known as Deepfakes and their ease of access to the general public have raised concern from all concerned bodies to their possible malicious intent use. Deep learning techniques can now generate faces, swap faces between two subjects in a video, alter facial expressions, change gender, and alter facial features, to list a few. These powerful video manipulation methods have potential use in many fields. However, they also pose a looming threat to everyone if used for harmful purposes such as identity theft, phishing, and scam. In this work, we propose a Convolutional Vision Transformer for the detection of Deepfakes. The Convolutional Vision Transformer has two components: Convolutional Neural Network (CNN) and Vision Transformer (ViT). The CNN extracts learnable features while the ViT takes in the learned features as input and categorizes them using an attention mechanism. We trained our model on the DeepFake Detection Challenge Dataset (DFDC) and have achieved 91.5 percent accuracy, an AUC value of 0.91, and a loss value of 0.32. Our contribution is that we have added a CNN module to the ViT architecture and have achieved a competitive result on the DFDC dataset.
翻译:深层学习技术现在可以产生面孔,将两个主题在视频中互换面孔,改变面部表达方式,改变性别,改变面部特征,以列出几个特征。这些强大的视频操纵方法在许多领域都有潜在用途。然而,如果用于身份盗窃、phishing和骗骗等有害目的,它们也会对每个人构成迫在眉睫的威胁。在这项工作中,我们提议为发现深层fakes而设置一个远大视野变异器。远大视觉变异器有两个组成部分:动态神经网络(CNN)和愿景变异器(VIT)。CNN提取了可学习的特征,而ViT在学习的特征中作为投入,使用关注机制对其进行分类。我们培训了我们的深层Fake探测挑战数据集模型(DFDC),并实现了9.1.5%的精确度、0.91的AUC值和0.32的损失值。我们的贡献是,我们为CNNCMDF结构增加了一个竞争性的结果。