We propose a self-supervised contrastive learning approach for facial expression recognition (FER) in videos. We propose a novel temporal sampling-based augmentation scheme to be utilized in addition to standard spatial augmentations used for contrastive learning. Our proposed temporal augmentation scheme randomly picks from one of three temporal sampling techniques: (1) pure random sampling, (2) uniform sampling, and (3) sequential sampling. This is followed by a combination of up to three standard spatial augmentations. We then use a deep R(2+1)D network for FER, which we train in a self-supervised fashion based on the augmentations and subsequently fine-tune. Experiments are performed on the Oulu-CASIA dataset and the performance is compared to other works in FER. The results indicate that our method achieves an accuracy of 89.4%, setting a new state-of-the-art by outperforming other works. Additional experiments and analysis confirm the considerable contribution of the proposed temporal augmentation versus the existing spatial ones.
翻译:我们建议对视频中的面部表达识别(FER)采取自我监督的对比式学习方法。 我们建议除了用于对比性学习的标准空间扩增之外,还采用新的时间抽样扩增方案。 我们提出的时间扩增方案随机地从三种时间抽样技术之一中提取:(1) 纯随机抽样,(2) 统一抽样,(3) 顺序抽样,然后将多达三个标准空间扩增组合在一起。 然后,我们用一个深R(2+1)DFFER网络来进行FER培训,我们根据扩增和随后的微调,以自我监督的方式对它进行培训。在Oulu-CASIA数据集上进行了实验,其性能与FER的其他工程进行比较。结果显示,我们的方法达到了89.4%的准确度,通过超过其他工程的性能来设定新的状态。其他实验和分析证实了拟议时间扩增与现有空间工程之间的巨大贡献。