Existing works on multimodal affective computing tasks, such as emotion recognition, generally adopt a two-phase pipeline, first extracting feature representations for each single modality with hand-crafted algorithms and then performing end-to-end learning with the extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extraction algorithms does not generalize or scale well to different tasks, which can lead to sub-optimal performance. In this paper, we develop a fully end-to-end model that connects the two phases and optimizes them jointly. In addition, we restructure the current datasets to enable the fully end-to-end training. Furthermore, to reduce the computational overhead brought by the end-to-end model, we introduce a sparse cross-modal attention mechanism for the feature extraction. Experimental results show that our fully end-to-end model significantly surpasses the current state-of-the-art models based on the two-phase pipeline. Moreover, by adding the sparse cross-modal attention, our model can maintain performance with around half the computation in the feature extraction part.
翻译:有关多式联运影响性计算任务的现有工作,例如情绪识别,一般采用双阶段管道,首先通过手工制作算法为每个单一模式提取特征说明,然后对提取的特征进行端到端学习;然而,提取的特征是固定的,无法对不同目标任务进行进一步微调,人工发现特征提取算法没有概括或放大到不同任务,可能导致次优业绩。在本文件中,我们开发了一个完全端到端的模型,将两个阶段连接起来,并联合优化这两个阶段。此外,我们调整了目前的数据集,以便能够进行全面端到端培训。此外,为了减少终端到端模型带来的计算间接费用,我们为特征提取引入了一种稀有的跨模式关注机制。实验结果表明,我们的完全端到端模型大大超过基于两阶段管道的当前最新模型。此外,通过添加稀疏的跨模式,我们的模型可以保持性能,在特征提取部分进行大约一半的计算。