Face presentation attack detection (PAD) plays an important role in defending face recognition systems against presentation attacks. The success of PAD largely relies on supervised learning that requires a huge number of labeled data, which is especially challenging for videos and often requires expert knowledge. To avoid the costly collection of labeled data, this paper presents a novel method for self-supervised video representation learning via motion prediction. To achieve this, we exploit the temporal consistency based on three RGB frames which are acquired at three different times in the video sequence. The obtained frames are then transformed into grayscale images where each image is specified to three different channels such as R(red), G(green), and B(blue) to form a dynamic grayscale snippet (DGS). Motivated by this, the labels are automatically generated to increase the temporal diversity based on DGS by using the different temporal lengths of the videos, which prove to be very helpful for the downstream task. Benefiting from the self-supervised nature of our method, we report the results that outperform existing methods on four public benchmark datasets, namely Replay-Attack, MSU-MFSD, CASIA-FASD, and OULU-NPU. Explainability analysis has been carried out through LIME and Grad-CAM techniques to visualize the most important features used in the DGS.
翻译:脸部攻击检测(PAD)在保护面部识别系统以抵御演示袭击方面发挥着重要作用。 PAD的成功主要依靠监督学习,这需要大量标签数据,这对视频来说尤其具有挑战性,而且往往需要专家知识。为了避免收集标签数据成本高昂,本文为通过运动预测进行自我监督视频表述学习提供了一个新方法。为了实现这一点,我们利用基于三个 RGB 框架的时间一致性,这三条RGB 框架是在视频序列的三个不同时间里获得的。获得的框随后转化为灰色图像,其中每个图像被指定给三个不同的频道,如R(红),G(绿色)和B(蓝色),以形成一个动态的灰色级断层(DGS)。受此驱动,这些标签自动生成,通过使用不同时间长度的视频来增加DGS的时际多样性,这证明对下游任务非常有益。我们从自我监督的方法中得益,我们报告的结果超越了四个公共基准数据集的现有方法,即Replay-Attack,G-AGRUS-MIS-SDMA 和最重要的图像分析技术在LAFA-SDISDA中应用。