In this paper, we introduce MINTIME, a video deepfake detection approach that captures spatial and temporal anomalies and handles instances of multiple people in the same video and variations in face sizes. Previous approaches disregard such information either by using simple a-posteriori aggregation schemes, i.e., average or max operation, or using only one identity for the inference, i.e., the largest one. On the contrary, the proposed approach builds on a Spatio-Temporal TimeSformer combined with a Convolutional Neural Network backbone to capture spatio-temporal anomalies from the face sequences of multiple identities depicted in a video. This is achieved through an Identity-aware Attention mechanism that attends to each face sequence independently based on a masking operation and facilitates video-level aggregation. In addition, two novel embeddings are employed: (i) the Temporal Coherent Positional Embedding that encodes each face sequence's temporal information and (ii) the Size Embedding that encodes the size of the faces as a ratio to the video frame size. These extensions allow our system to adapt particularly well in the wild by learning how to aggregate information of multiple identities, which is usually disregarded by other methods in the literature. It achieves state-of-the-art results on the ForgeryNet dataset with an improvement of up to 14% AUC in videos containing multiple people and demonstrates ample generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection.
翻译:在本文中, 我们引入了 MINTIME, 这是一种视频深假探测方法, 包含空间和时间异常, 并处理多个人以相同视频和面尺寸变化的图像形式出现的情况。 以往的方法不考虑这些信息, 要么使用简单的隐性聚合计划, 即平均或最大操作, 要么只使用一种身份来推断, 即最大操作。 与此相反, 提议的方法建立在Spatio- Temporal Temporal TimSext 上, 加上一个 Convolual Neal 网络主干线, 以从视频中描述的多个身份的面部序列中捕捉到表层异常。 这是通过一个身份觉注意机制实现的, 它独立地关注每个脸部的组合组合组合组合组合, 即平均操作, 便利视频系统使用两种新型隐性隐性隐性定位, 将每个脸部序列的暂时信息编码和( 二) 将脸部的大小编译成与视频框架的比对立。 这些扩展使得我们系统在普通的图像数据库中 能够对总体数据进行更精确地调整结果。, 。 这些系统里, 系统在普通数据库中, 将系统里, 将数据转换成更精确的系统里程中, 将系统里程到更精确地对普通数据向普通数据向普通数据向普通数据进行更精确地显示。