In today's era of digital misinformation, we are increasingly faced with new threats posed by video falsification techniques. Such falsifications range from cheapfakes (e.g., lookalikes or audio dubbing) to deepfakes (e.g., sophisticated AI media synthesis methods), which are becoming perceptually indistinguishable from real videos. To tackle this challenge, we propose a multi-modal semantic forensic approach to discover clues that go beyond detecting discrepancies in visual quality, thereby handling both simpler cheapfakes and visually persuasive deepfakes. In this work, our goal is to verify that the purported person seen in the video is indeed themselves by detecting anomalous correspondences between their facial movements and the words they are saying. We leverage the idea of attribution to learn person-specific biometric patterns that distinguish a given speaker from others. We use interpretable Action Units (AUs) to capture a persons' face and head movement as opposed to deep CNN visual features, and we are the first to use word-conditioned facial motion analysis. Unlike existing person-specific approaches, our method is also effective against attacks that focus on lip manipulation. We further demonstrate our method's effectiveness on a range of fakes not seen in training including those without video manipulation, that were not addressed in prior work.
翻译:在当今数字错误信息时代,我们日益面临视频伪造技术带来的新威胁。这些伪造手段包括廉价假象(如外观或音响假象)和深假(如复杂的AI媒体合成方法),这些假象与真实的视频有着明显的分辨。为了应对这一挑战,我们提议采用多种现代语义法法医学方法来发现超越发现视觉质量差异的线索,从而既处理简单的廉价假冒,又处理视觉上具有说服力的深刻假冒。在这项工作中,我们的目标是核实视频中看到的人确实是自己,通过发现其面部运动和他们所说的话之间的反常通信。我们利用归属概念来学习区分特定发言人和其他人的特有生物学模式。我们使用可解释行动股来捕捉一个人的脸和头部,而不是深层CNN视觉特征,我们首先使用有文字限制的面部动作分析。与现有的针对特定个人的方法不同,我们的方法本身确实是通过探测其面部运动和言词的反常识。我们利用这种方法来有效防止攻击行为,我们没有在事先进行假操作时,我们没有看到这种假操作的方法。