Multimodal Emotion Recognition in Conversation (MERC) significantly enhances emotion recognition performance by integrating complementary emotional cues from text, audio, and visual modalities. While existing methods commonly utilize techniques such as contrastive learning and cross-attention mechanisms to align cross-modal emotional semantics, they typically overlook modality-specific emotional nuances like micro-expressions, tone variations, and sarcastic language. To overcome these limitations, we propose Orthogonal Disentanglement with Projected Feature Alignment (OD-PFA), a novel framework designed explicitly to capture both shared semantics and modality-specific emotional cues. Our approach first decouples unimodal features into shared and modality-specific components. An orthogonal disentanglement strategy (OD) enforces effective separation between these components, aided by a reconstruction loss to maintain critical emotional information from each modality. Additionally, a projected feature alignment strategy (PFA) maps shared features across modalities into a common latent space and applies a cross-modal consistency alignment loss to enhance semantic coherence. Extensive evaluations on widely-used benchmark datasets, IEMOCAP and MELD, demonstrate effectiveness of our proposed OD-PFA multimodal emotion recognition tasks, as compared with the state-of-the-art approaches.
翻译:多模态对话情感识别通过整合文本、音频和视觉模态的互补情感线索,显著提升了情感识别性能。现有方法通常采用对比学习和跨注意力机制等技术来对齐跨模态情感语义,但往往忽略了模态特有的情感细微差异,如微表情、语调变化和讽刺性语言。为克服这些局限,我们提出了基于投影特征对齐的正交解耦框架,该框架专门设计用于同时捕捉共享语义和模态特有的情感线索。我们的方法首先将单模态特征解耦为共享成分和模态特有成分。正交解耦策略通过重建损失的辅助,强制实现这些成分间的有效分离,同时保留各模态的关键情感信息。此外,投影特征对齐策略将跨模态的共享特征映射到公共潜在空间,并应用跨模态一致性对齐损失以增强语义连贯性。在广泛使用的基准数据集IEMOCAP和MELD上的大量评估表明,相较于现有先进方法,我们提出的OD-PFA框架在多模态情感识别任务中具有显著有效性。