We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, eg, loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as an aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.
翻译:我们建议根据声音和视觉模式(称为“模式差异评分 ” ( MDS ) 之间的差异来探测深假视频。 我们假设,操纵任何一种模式都会导致两种模式之间的不和谐,例如唇合体的丧失、非自然面部和嘴唇运动等。 MDS是在视频中作为视听和视觉部分之间差异得分的总和来计算出来的。 以块状方式为音频和视觉频道学习了差异性特征,在单个模式中采用交叉性机体损失,以及模拟模式间相似性的对比性损失。 关于DFDC和DeepFake-TIMIT数据集的广泛实验显示,我们的方法比艺术状态高出高达7%。 我们还展示了时间伪造本地化,并展示了我们技术如何识别被操纵的视频段。