Recent advances in face forgery techniques produce nearly visually untraceable deepfake videos, which could be leveraged with malicious intentions. As a result, researchers have been devoted to deepfake detection. Previous studies has identified the importance of local low-level cues and temporal information in pursuit to generalize well across deepfake methods, however, they still suffer from robustness problem against post-processings. In this work, we propose the Local- & Temporal-aware Transformer-based Deepfake Detection (LTTD) framework, which adopts a local-to-global learning protocol with a particular focus on the valuable temporal information within local sequences. Specifically, we propose a Local Sequence Transformer (LST), which models the temporal consistency on sequences of restricted spatial regions, where low-level information is hierarchically enhanced with shallow layers of learned 3D filters. Based on the local temporal embeddings, we then achieve the final classification in a global contrastive way. Extensive experiments on popular datasets validate that our approach effectively spots local forgery cues and achieves state-of-the-art performance.
翻译:在这项工作中,我们提议采用基于地方和时间觉变异器的深假体探测(LTTD)框架,该框架采用一个地方到全球学习协议,特别侧重于当地序列中的宝贵时间信息。具体地说,我们提议采用一个地方序列变异器(LST),该变异器模拟了限制空间区域序列的时间一致性,该变异器以浅层的3D过滤器为标准,在这些区域中,低级信息在等级上得到加强。根据当地时间嵌入,我们随后以全球对比方式实现最后分类。关于流行数据集的广泛实验证实,我们的方法有效地定位了当地伪造信号并实现了最先进的性能。