While the abuse of deepfake technology has caused serious concerns recently, how to detect deepfake videos is still a challenge due to the high photo-realistic synthesis of each frame. Existing image-level approaches often focus on single frame and ignore the spatiotemporal cues hidden in deepfake videos, resulting in poor generalization and robustness. The key of a video-level detector is to fully exploit the spatiotemporal inconsistency distributed in local facial regions across different frames in deepfake videos. Inspired by that, this paper proposes a simple yet effective patch-level approach to facilitate deepfake video detection via spatiotemporal dropout transformer. The approach reorganizes each input video into bag of patches that is then fed into a vision transformer to achieve robust representation. Specifically, a spatiotemporal dropout operation is proposed to fully explore patch-level spatiotemporal cues and serve as effective data augmentation to further enhance model's robustness and generalization ability. The operation is flexible and can be easily plugged into existing vision transformers. Extensive experiments demonstrate the effectiveness of our approach against 25 state-of-the-arts with impressive robustness, generalizability, and representation ability.
翻译:虽然滥用深假技术最近引起了严重的关注,但由于对每个框架进行高光现实化的合成,如何探测深假视频仍是一个挑战。现有的图像层面方法往往侧重于单一框架,忽视深假视频中隐藏的时空提示,导致不全面且稳健。视频层面探测器的关键是充分利用在深假视频中不同框架在地方面部区域分布的时空不一。受此启发,本文件提出一个简单而有效的补丁层面方法,以便利通过悬浮时空变异器进行深假视频检测。该方法将每个输入的视频重组成包补丁,然后输入成一个视觉变异器,以实现强健健健的演示。具体地说,拟进行一个随机时空脱轨操作,以充分探索偏差级的时空提示,作为有效的数据增强数据增强,以进一步加强模型的稳健健和普及能力。该操作是灵活的,可以很容易地插入到现有的视觉变异器中。广泛的实验表明我们的方法对25个州的可塑性、可观性、可观性。