Manipulated videos often contain subtle inconsistencies between their visual and audio signals. We propose a video forensics method, based on anomaly detection, that can identify these inconsistencies, and that can be trained solely using real, unlabeled data. We train an autoregressive model to generate sequences of audio-visual features, using feature sets that capture the temporal synchronization between video frames and sound. At test time, we then flag videos that the model assigns low probability. Despite being trained entirely on real videos, our model obtains strong performance on the task of detecting manipulated speech videos. Project site: https://cfeng16.github.io/audio-visual-forensics
翻译:操纵视频的视觉和音频信号之间往往存在微妙的不一致之处。 我们提议一种基于异常点探测的视频法证方法,该方法可以辨别这些不一致之处,并且只能用真实的、未贴标签的数据进行培训。 我们训练一种自动递减模型,以生成视听特征的序列,使用成套功能来捕捉视频框架和声音之间的时间同步。然后在测试时,我们标出该模型给定的概率低的视频。尽管我们完全接受了真实视频培训,但我们的模型在探测受操纵的语音视频的任务上取得了很强的成绩。项目网站:https://cfeng16.github.io/audio-visual-forensics。项目网站:https://ceng16.github.io/audio-visual-forensiccs。