Face presentation attack detection (PAD) plays an important role in defending face recognition systems against presentation attacks. The success of PAD largely relies on supervised learning that requires a huge number of labeled data, which is especially challenging for videos and often requires expert knowledge. To avoid the costly collection of labeled data, this paper presents a novel method for self-supervised video representation learning via motion prediction. To achieve this, we exploit the temporal consistency based on three RGB frames which are acquired at three different times in the video sequence. The obtained frames are then transformed into grayscale images where each image is specified to three different channels such as R(red), G(green), and B(blue) to form a dynamic grayscale snippet (DGS). Motivated by this, the labels are automatically generated to increase the temporal diversity based on DGS by using the different temporal lengths of the videos, which prove to be very helpful for the downstream task. Benefiting from the self-supervised nature of our method, we report the results that outperform existing methods on four public benchmarks, namely, Replay-Attack, MSU-MFSD, CASIA-FASD, and OULU-NPU. Explainability analysis has been carried out through LIME and Grad-CAM techniques to visualize the most important features used in the DGS.
翻译:脸部显示器检测(PAD)在保护面部识别系统不受演示攻击方面发挥着重要作用。 PAD的成功主要依赖于监督学习,这需要大量标签数据,这对视频来说尤其具有挑战性,而且往往需要专家知识。为了避免收集标签数据成本高昂,本文为通过运动预测进行自我监督视频表述学习提供了一个新方法。为了实现这一点,我们利用基于三个 RGB 框架的时间一致性,这三条RGB 框架是在视频序列的三个不同时间里获得的。获得的框随后转化为灰色图像,其中每个图像被指定给三个不同的渠道,如R(红),G(绿色)和B(蓝色),以形成一个动态灰色级片(DGS)。受此驱动,通过使用视频的不同时间长度自动生成标签以增加DGS的时际多样性,这证明对下游任务非常有用。我们从自我监督的方法中得益,我们报告的结果超越了四种公共基准的现有方法,即 Replay-Attack,MS-M-MFA-MADLSD, 最关键的SDLSDSD, 和最重要的SDI-MA-MADADADLSDM技术在LSDA中已经进行了最重要的可视域分析。