Human beings have rich ways of emotional expressions, including facial action, voice, and natural languages. Due to the diversity and complexity of different individuals, the emotions expressed by various modalities may be semantically irrelevant. Directly fusing information from different modalities may inevitably make the model subject to the noise from semantically irrelevant modalities. To tackle this problem, we propose a multimodal relevance estimation network to capture the relevant semantics among modalities in multimodal emotions. Specifically, we take advantage of an attention mechanism to reflect the semantic relevance weights of each modality. Moreover, we propose a relevant semantic estimation loss to weakly supervise the semantics of each modality. Furthermore, we make use of contrastive learning to optimize the similarity of category-level modality-relevant semantics across different modalities in feature space, thereby bridging the semantic gap between heterogeneous modalities. In order to better reflect the emotional state in the real interactive scenarios and perform the semantic relevance analysis, we collect a single-label discrete multimodal emotion dataset named SDME, which enables researchers to conduct multimodal semantic relevance research with large category bias. Experiments on continuous and discrete emotion datasets show that our model can effectively capture the relevant semantics, especially for the large deviations in modal semantics. The code and SDME dataset will be publicly available.
翻译:人类有着丰富的情感表达方式,包括面部动作、声音和自然语言。由于不同个人的多样性和复杂性,不同模式表达的情感可能与语义无关。 直接模糊不同模式的信息可能不可避免地使模型受到语义无关模式的噪音的影响。 为了解决这一问题,我们提议建立一个多式相关性估算网络,以捕捉多式情感模式之间的相关语义。 具体地说, 我们利用一个关注机制来反映每种模式的语义相关性重量。 此外, 我们提出一个相关的语义估计损失, 以弱化地监督每一种模式的语义。 此外, 我们利用对比性学习来优化不同模式在地物空间的类似性, 从而缩小不同模式之间的语义差距。 为了更好地反映真实互动情景中的情感状态, 并进行语义相关性分析, 我们收集了一个单标签的离散式多式联运情感数据数据集。 这使得研究人员能够进行多式的语义相关研究, 大规模类别偏差。 对连续和离散模式相关的模式相关语义数据进行实验, 用于持续和离散式的磁性数据模型, 将有效地显示可公开获取的磁性数据。