This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our codes and pre-trained models will be made public.
翻译:本文建议采取自我监督的方法,从视频中学习普遍面部表情,这种方法可以跨越面部分析任务,如面部特征识别(FAR)、面部表现识别(FER)、面部表现识别(FER)、深 Fake检测(DFD)和唇同步(LS)等。我们提议的框架名为MARLIN,是一个面部视频蒙面自动编码仪,它从大量可用的非附加说明的网络爬行面部视频中学习高度稳健和普通面部嵌入。作为具有挑战性的辅助任务,MARLIN从面部高度遮蔽的面部区域中重新构建面部时部细节,主要包括眼睛、鼻子、嘴唇、嘴和皮肤,以捕捉地方和全球方面,从而帮助编码通用和可转移特征。我们通过对各种下游任务进行的各种实验,展示MARLIN是一个出色的面部视频诱导和特征提取器,在各种下游任务中始终执行各种任务,包括FAR(1.13%的收益超过监督基准)、FER(2.64%超过未超标的基准)、DFDDD(1.8%的升级前数据将获得。