This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN .
翻译:本文提出了一种自我监督的方法来学习视频中通用的人脸表示,能够在各种人脸分析任务之间进行转移,例如面部属性识别(FAR),面部表情识别(FER),DeepFake检测(DFD)和嘴唇同步(LS)。我们提出的框架名为MARLIN,是一种人脸视频遮罩自动编码器,可以从大量可用的非注释网络爬取人脸视频中学习高度鲁棒和通用的人脸嵌入。作为一项具有挑战性的辅助任务,MARLIN可以从密集掩蔽的人脸区域(主要包括眼睛、鼻子、嘴巴、唇部和皮肤)中重构面部的时空细节,以捕捉局部和全局方面的信息,从而有助于编码通用和可转移的特征。通过各种不同的下游任务的实验,我们展示了MARLIN作为优秀的人脸视频编码器和特征提取器,能够在各种下游任务中保持相对一致的表现,包括FAR(相对于监督基准提高了1.13%),FER(相对于无监督基准提高了2.64%),DFD(相对于无监督基准提高了1.86%),LS(Frechet Inception Distance提高了29.36%),甚至在低数据情况下也能达到很好的效果。我们的代码和模型可在 https://github.com/ControlNet/MARLIN 中获得。