Human-interaction-involved applications underscore the need for Multi-modal Sentiment Analysis (MSA). Although many approaches have been proposed to address the subtle emotions in different modalities, the power of explanations and temporal alignments is still underexplored. Thus, this paper proposes the Text-routed sparse mixture-of-Experts model with eXplanation and Temporal alignment for MSA (TEXT). TEXT first augments explanations for MSA via Multi-modal Large Language Models (MLLM), and then novelly aligns the epresentations of audio and video through a temporality-oriented neural network block. TEXT aligns different modalities with explanations and facilitates a new text-routed sparse mixture-of-experts with gate fusion. Our temporal alignment block merges the benefits of Mamba and temporal cross-attention. As a result, TEXT achieves the best performance cross four datasets among all tested models, including three recently proposed approaches and three MLLMs. TEXT wins on at least four metrics out of all six metrics. For example, TEXT decreases the mean absolute error to 0.353 on the CH-SIMS dataset, which signifies a 13.5% decrement compared with recently proposed approaches.
翻译:涉及人机交互的应用凸显了多模态情感分析的重要性。尽管已有许多方法被提出来处理不同模态中的微妙情感,但解释和时间对齐的作用仍未得到充分探索。因此,本文提出了一种用于多模态情感分析的、具备解释与时间对齐的文本路由稀疏专家混合模型。该模型首先通过多模态大语言模型为多模态情感分析增强解释,然后通过一个面向时序的神经网络模块创新性地对齐音频和视频的表征。该模型通过解释对齐不同模态,并促进了一种新的基于文本路由的稀疏专家混合门控融合机制。我们的时间对齐模块融合了Mamba和时序交叉注意力的优势。实验结果表明,该模型在四个数据集上取得了所有测试模型中的最佳性能,包括三种近期提出的方法和三种多模态大语言模型。该模型在全部六项评估指标中至少四项上表现最优。例如,在CH-SIMS数据集上,该模型将平均绝对误差降低至0.353,相较于近期提出的方法实现了13.5%的误差降低。