Vision transformer (ViT) has been widely applied in many areas due to its self-attention mechanism that help obtain the global receptive field since the first layer. It even achieves surprising performance exceeding CNN in some vision tasks. However, there exists an issue when leveraging vision transformer into 2D+3D facial expression recognition (FER), i.e., ViT training needs mass data. Nonetheless, the number of samples in public 2D+3D FER datasets is far from sufficient for evaluation. How to utilize the ViT pre-trained on RGB images to handle 2D+3D data becomes a challenge. To solve this problem, we propose a robust lightweight pure transformer-based network for multimodal 2D+3D FER, namely MFEViT. For narrowing the gap between RGB and multimodal data, we devise an alternative fusion strategy, which replaces each of the three channels of an RGB image with the depth-map channel and fuses them before feeding them into the transformer encoder. Moreover, the designed sample filtering module adds several subclasses for each expression and move the noisy samples to their corresponding subclasses, thus eliminating their disturbance on the network during the training stage. Extensive experiments demonstrate that our MFEViT outperforms state-of-the-art approaches with an accuracy of 90.83% on BU-3DFE and 90.28% on Bosphorus. Specifically, the proposed MFEViT is a lightweight model, requiring much fewer parameters than multi-branch CNNs. To the best of our knowledge, this is the first work to introduce vision transformer into multimodal 2D+3D FER. The source code of our MFEViT will be publicly available online.
翻译:视觉变压器( ViT) 已在很多领域广泛应用, 因为它自第一层以来的自我注意机制帮助获得全球可接收域。 它甚至在某些视觉任务中达到超CNN的惊人性能。 但是, 当将视觉变压器运用到 2D+3D 面部识别( FER) 时, 即 ViT 培训需要质量数据 。 然而, 公开 2D+3D FER 数据集中的样本数量远远不足以进行评估。 如何利用在 RGB 图像上预先训练的 RGB 数字处理 2D+3D 数据, 这已成为一项挑战 。 为了解决这个问题, 我们提议为 2D+3D FER 提供一个强大的轻度纯度纯度变压器网络。 为了缩小 RGB 和 Modald 数据之间的差距, 我们设计了一个替代的组合战略, 将RGB 图像的三个频道中的每一个频道替换为深度调压频道, 在将它们输入变压器变压器的MFID 。 此外, 设计的样本过滤器模块将增加几个次次表达式的表达器, 。 并且将引入了我们的BEFEBE3 样样本变压器的网络 演示中, 演示中, 将大量的路径将大大演示到我们的网络 。