One of the most pressing challenges for the detection of face-manipulated videos is generalising to forgery methods not seen during training while remaining effective under common corruptions such as compression. In this paper, we question whether we can tackle this issue by harnessing videos of real talking faces, which contain rich information on natural facial appearance and behaviour and are readily available in large quantities online. Our method, termed RealForensics, consists of two stages. First, we exploit the natural correspondence between the visual and auditory modalities in real videos to learn, in a self-supervised cross-modal manner, temporally dense video representations that capture factors such as facial movements, expression, and identity. Second, we use these learned representations as targets to be predicted by our forgery detector along with the usual binary forgery classification task; this encourages it to base its real/fake decision on said factors. We show that our method achieves state-of-the-art performance on cross-manipulation generalisation and robustness experiments, and examine the factors that contribute to its performance. Our results suggest that leveraging natural and unlabelled videos is a promising direction for the development of more robust face forgery detectors.
翻译:发现面部操纵视频的最紧迫挑战之一是将培训期间看不到的伪造方法概括到培训中,而在压缩等常见腐败下依然有效。 在本文中,我们质疑我们能否通过利用真实面孔的视频来解决这一问题,这些视频包含关于自然面部外观和行为的丰富信息,并且大量在线提供。我们的方法称为RealForensic,由两个阶段组成。首先,我们利用真实视频中的视觉和听力模式之间的自然通信,以自我监督的跨模式方式学习、时间密集的视频演示,以捕捉面部运动、表达和身份等因素。第二,我们将这些学习的表述作为我们伪造探测器预测的目标,同时进行普通的二进制伪造分类工作;这鼓励它根据上述因素作出真实/假决定。我们显示,我们的方法在交叉调节和稳健性实验中达到了最先进的性能,并检查有助于其性能的因素。我们的结果表明,利用自然和未贴标签的视频是发展更稳健的面部探测器的一个很有希望的方向。