A micro-expression is a spontaneous unconscious facial muscle movement that can reveal the true emotions people attempt to hide. Although manual methods have made good progress and deep learning is gaining prominence. Due to the short duration of micro-expression and different scales of expressed in facial regions, existing algorithms cannot extract multi-modal multi-scale facial region features while taking into account contextual information to learn underlying features. Therefore, in order to solve the above problems, a multi-modal multi-scale algorithm based on transformer network is proposed in this paper, aiming to fully learn local multi-grained features of micro-expressions through two modal features of micro-expressions - motion features and texture features. To obtain local area features of the face at different scales, we learned patch features at different scales for both modalities, and then fused multi-layer multi-headed attention weights to obtain effective features by weighting the patch features, and combined cross-modal contrastive learning for model optimization. We conducted comprehensive experiments on three spontaneous datasets, and the results show the accuracy of the proposed algorithm in single measurement SMIC database is up to 78.73% and the F1 value on CASMEII of the combined database is up to 0.9071, which is at the leading level.
翻译:微表情是一种自自发的、不自觉的面部肌肉运动,可以揭示人们试图隐藏的真实情感。虽然人工方法取得了良好的进步,深层次的学习也越来越突出。由于微表情和面部区域表达的不同比例的短暂时间,现有的算法无法提取多模式多尺度面部区域特征,同时考虑到背景信息以学习基本特征。因此,为了解决上述问题,本文件提议了一种基于变压器网络的多模式多尺度算法,目的是通过微表情的两个模式特征(运动特征和纹理特征)充分学习微表情的多重特征。为了在不同尺度上取得表情的局部特征,我们学习了两种模式的不同尺度上的局部特征,然后结合了多层多层多层关注权重,以便通过对补位特征的加权和模型优化的交叉模式对比学习获得有效特征。我们对三个自发数据集进行了全面试验,结果显示,在单一计量SMIC数据库中拟议的算法的准确性能是78.73%,而F1数据库是0.91的综合值。