Recognising continuous emotions and action unit (AU) intensities from face videos requires a spatial and temporal understanding of expression dynamics. Existing works primarily rely on 2D face appearances to extract such dynamics. This work focuses on a promising alternative based on parametric 3D face shape alignment models, which disentangle different factors of variation, including expression-induced shape variations. We aim to understand how expressive 3D face shapes are in estimating valence-arousal and AU intensities compared to the state-of-the-art 2D appearance-based models. We benchmark four recent 3D face alignment models: ExpNet, 3DDFA-V2, DECA, and EMOCA. In valence-arousal estimation, expression features of 3D face models consistently surpassed previous works and yielded an average concordance correlation of .739 and .574 on SEWA and AVEC 2019 CES corpora, respectively. We also study how 3D face shapes performed on AU intensity estimation on BP4D and DISFA datasets, and report that 3D face features were on par with 2D appearance features in AUs 4, 6, 10, 12, and 25, but not the entire set of AUs. To understand this discrepancy, we conduct a correspondence analysis between valence-arousal and AUs, which points out that accurate prediction of valence-arousal may require the knowledge of only a few AUs.
翻译:现有作品主要依赖 2D 面部外观来提取这种动态。 这项工作侧重于基于参数 3D 面形调整模型的有希望的替代方案,该模型分离了不同变异因素,包括表达诱发的形状变异。我们的目的是了解3D 面部形状在估计价值 - 刺激性和非盟的强度方面是如何与基于2D 最新外观模型相比的。我们以最近四个3D 面部调整模型为基准:ExcNet、3DDFA-V2、DECA和EMOCA。价值-振奋估计,3D 面部外观模型的特征持续超过以往的作品,并产生了SEAWA和AVEC 2019 Cocora分别的739和5.74的平均一致性关系。我们还研究3D 面部在估计BP4D 和 DIFA 数据集的强度方面是如何形成的。报告,3D 3D面面面部的特征与2D的外观特征不同,AUs 6、A 12 和A 的准确性分析可能要求A 12 和A 10 的准确的准确性分析。