Systems for multimodal emotion recognition (MMER) can typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. In this paper, an MMER method is proposed that relies on a joint multimodal transformer for fusion with key-based cross-attention. This framework aims to exploit the diverse and complementary nature of different modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, a joint multimodal transformer fusion architecture integrates the individual modality embeddings, allowing the model to capture inter-modal and intra-modal relationships effectively. Extensive experiments on two challenging expression recognition tasks: (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice), and (2) pain estimation on the Biovid dataset (with face and biosensors), indicate that the proposed method can work effectively with different modalities. Empirical results show that MMER systems with our proposed fusion method allow us to outperform relevant baseline and state-of-the-art methods.
翻译:暂无翻译