Being spontaneous, micro-expressions are useful in the inference of a person's true emotions even if an attempt is made to conceal them. Due to their short duration and low intensity, the recognition of micro-expressions is a difficult task in affective computing. The early work based on handcrafted spatio-temporal features which showed some promise, has recently been superseded by different deep learning approaches which now compete for the state of the art performance. Nevertheless, the problem of capturing both local and global spatio-temporal patterns remains challenging. To this end, herein we propose a novel spatio-temporal transformer architecture -- to the best of our knowledge, the first purely transformer based approach (i.e. void of any convolutional network use) for micro-expression recognition. The architecture comprises a spatial encoder which learns spatial patterns, a temporal aggregator for temporal dimension analysis, and a classification head. A comprehensive evaluation on three widely used spontaneous micro-expression data sets, namely SMIC-HS, CASME II and SAMM, shows that the proposed approach consistently outperforms the state of the art, and is the first framework in the published literature on micro-expression recognition to achieve the unweighted F1-score greater than 0.9 on any of the aforementioned data sets.
翻译:自我自发的微表情对于推断一个人的真实情感是有用的,即使有人试图隐藏这种情绪。由于时间短和强度低,承认微表情是影响性计算中的一项困难任务。基于手工艺的时空特征的早期工作显示了一些希望,最近被不同的深层次学习方法所取代,这些方法现在争夺艺术性能的状态。然而,捕捉当地和全球的时空空间模式的问题仍然具有挑战性。为此,我们提议建立一个新型的时空变压器结构 -- -- 根据我们的最佳知识,第一种纯粹的变压器方法(即没有使用任何革命网络)来进行微表态识别。该结构包括一个空间编码器,它学习空间模式,一个时间尺寸分析的时间聚合器,以及一个分类头。对三种广泛使用的自发式微表态数据集,即SMIC-HS、CASME II和SAMM的全面评价显示,拟议的方法在我们所了解的最好的范围内,在所出版的艺术的微量度1 上持续超越了所出版的数据1 的完整度框架。