With the rapid development of face forgery technology, deepfake videos have attracted widespread attention in digital media. Perpetrators heavily utilize these videos to spread disinformation and make misleading statements. Most existing methods for deepfake detection mainly focus on texture features, which are likely to be impacted by external fluctuations, such as illumination and noise. Besides, detection methods based on facial landmarks are more robust against external variables but lack sufficient detail. Thus, how to effectively mine distinctive features in the spatial, temporal, and frequency domains and fuse them with facial landmarks for forgery video detection is still an open question. To this end, we propose a Landmark Enhanced Multimodal Graph Neural Network (LEM-GNN) based on multiple modalities' information and geometric features of facial landmarks. Specifically, at the frame level, we have designed a fusion mechanism to mine a joint representation of the spatial and frequency domain elements while introducing geometric facial features to enhance the robustness of the model. At the video level, we first regard each frame in a video as a node in a graph and encode temporal information into the edges of the graph. Then, by applying the message passing mechanism of the graph neural network (GNN), the multimodal feature will be effectively combined to obtain a comprehensive representation of the video forgery. Extensive experiments show that our method consistently outperforms the state-of-the-art (SOTA) on widely-used benchmarks.
翻译:随着面部伪造技术的迅速发展,深假视频在数字媒体中引起了广泛的关注。犯罪人大量利用这些视频传播假信息并作出误导性陈述。大多数深假探测方法主要侧重于可能受到外部波动影响的质谱特征,如照明和噪音等。此外,基于面部标志的检测方法对外部变量更为有力,但不够详细。因此,如何有效挖掘空间、时空和频域中的独特特征,并将这些特征与伪造视频探测的面部标志连接起来,仍然是一个尚未解决的问题。为此,我们提议基于多种模式的信息和面部标志的几何特征的地标强化多式多式图像神经网络(LEM-GNN),具体地说,在框架一级,我们设计了一个聚合机制,用于联合展示空间和频域要素,同时引入几何面面特征特征,以加强模型的稳健性。在视频层面,我们首先将每个视频框视为图中的节点,并将时间信息编码在图表的边缘。然后,我们根据多种模式信息和面部特征特征特征特征特征的特征测试机制,将持续地通过模型测试系统系统。