Humans are skilled in reading the interlocutor's emotion from multimodal signals, including spoken words, simultaneous speech, and facial expressions. It is still a challenge to effectively decode emotions from the complex interactions of multimodal signals. In this paper, we design three kinds of multimodal latent representations to refine the emotion analysis process and capture complex multimodal interactions from different views, including a intact three-modal integrating representation, a modality-shared representation, and three modality-individual representations. Then, a modality-semantic hierarchical fusion is proposed to reasonably incorporate these representations into a comprehensive interaction representation. The experimental results demonstrate that our EffMulti outperforms the state-of-the-art methods. The compelling performance benefits from its well-designed framework with ease of implementation, lower computing complexity, and less trainable parameters.
翻译:人类能够从多式信号中读懂对话者的情感,包括语言、同时讲话和面部表达,这仍然是将情感有效地从多式信号的复杂互动中解开的挑战。 在本文中,我们设计了三种多式潜在代表形式,以完善情感分析过程并从不同观点中捕捉复杂的多式互动,包括一个完整的三模式一体化代表形式、一个模式共享代表形式和三个模式个体代表形式。 然后,建议一种模式-结构性等级融合,以合理地将这些表达形式纳入全面互动代表形式。 实验结果表明,我们的EffMulti 超越了最先进的方法。 其设计良好的框架在易于实施、计算复杂程度较低和训练性较低的参数方面带来了令人信服的绩效效益。