In this paper, we propose a novel speech emotion recognition model called Cross Attention Network (CAN) that uses aligned audio and text signals as inputs. It is inspired by the fact that humans recognize speech as a combination of simultaneously produced acoustic and textual signals. First, our method segments the audio and the underlying text signals into equal number of steps in an aligned way so that the same time steps of the sequential signals cover the same time span in the signals. Together with this technique, we apply the cross attention to aggregate the sequential information from the aligned signals. In the cross attention, each modality is aggregated independently by applying the global attention mechanism onto each modality. Then, the attention weights of each modality are applied directly to the other modality in a crossed way, so that the CAN gathers the audio and text information from the same time steps based on each modality. In the experiments conducted on the standard IEMOCAP dataset, our model outperforms the state-of-the-art systems by 2.66% and 3.18% relatively in terms of the weighted and unweighted accuracy.
翻译:在本文中,我们提出了一个新型的语音情绪识别模型,称为交叉注意网络(CAN),它使用一致的音频和文字信号作为投入。它受到以下事实的启发:人类承认语音是同时生成的声频和文字信号的组合。首先,我们的方法将音频和基本文字信号以一致的方式分解成相同数量的步骤,以便相继信号的同一时间段覆盖信号的同一时间段。与这一技术一起,我们运用交叉注意将一致信号的顺序信息汇总在一起。在交叉注意中,每种模式都通过将全球注意机制应用到每种模式而独立地加以汇总。然后,每种模式的注意权重以交叉方式直接应用到其他模式,以便CAN能够根据每种模式的相同时间段收集音频和文字信息。在对标准的 IEMOCAP 数据集进行的实验中,我们的模型在加权和非加权准确性方面比最新系统高出2.66%和3.18%。