In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behaviour. Overall, our multistage fusion shows better quantitative performance, surpassing alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.
翻译:在这一贡献中,我们调查了文字和音频特征的深度融合对于直线和维度语音情感识别的有效性。我们建议了一种新型的多阶段融合方法,将两种信息流整合在一个深层神经网络(DNN)的几层中,并将它与单阶段融合在一个点上,将流流合并在一起。两种方法都取决于从经过预先培训的BERT模型中提取简要语言嵌入,并调整在日志-Mel光谱学上运行的动态模型的一个或多个中间表示。MSP-Podcast 和 IEMOCAP 数据集的实验表明,这两种融合方法在数量性能和定性行为两方面显然都超越了浅(晚)聚变基线及其单形式成分。总体而言,我们的多阶段融合显示了更好的数量性表现,超过了我们大多数评估的替代方法。这说明了多阶段融合在更接近文本和音频信息方面的潜力。