One of the most pressing challenges for the detection of face-manipulated videos is generalising to forgery methods not seen during training while remaining effective under common corruptions such as compression. In this paper, we examine whether we can tackle this issue by harnessing videos of real talking faces, which contain rich information on natural facial appearance and behaviour and are readily available in large quantities online. Our method, termed RealForensics, consists of two stages. First, we exploit the natural correspondence between the visual and auditory modalities in real videos to learn, in a self-supervised cross-modal manner, temporally dense video representations that capture factors such as facial movements, expression, and identity. Second, we use these learned representations as targets to be predicted by our forgery detector along with the usual binary forgery classification task; this encourages it to base its real/fake decision on said factors. We show that our method achieves state-of-the-art performance on cross-manipulation generalisation and robustness experiments, and examine the factors that contribute to its performance. Our results suggest that leveraging natural and unlabelled videos is a promising direction for the development of more robust face forgery detectors.
翻译:检测面部管理视频的最紧迫挑战之一是推广培训期间看不到的伪造方法,同时在压缩等常见腐败下依然有效。 在本文中,我们研究我们是否可以利用真实面孔的视频来解决这一问题,这些视频包含关于自然面部外观和行为的丰富信息,并且大量在线提供。我们的方法称为RealForensic,由两个阶段组成。首先,我们利用真实视频中的视觉和听觉模式之间的自然通信,以自我监督的跨模式方式学习时间密集的视频演示,以捕捉面部运动、表达和身份等因素。第二,我们利用这些学习的表述作为目标,由我们的伪造探测器预测,同时进行普通的二进制伪造分类工作;这鼓励它根据上述因素作出真实/假决定。我们显示,我们的方法在交叉管理总体和稳健度实验中达到了最先进的表现,并考察有助于其表现的因素。我们的结果表明,利用自然和未贴标签的视频是发展更稳健的面部探测器的一个很有希望的方向。