Models that can handle a wide range of speakers and acoustic conditions are essential in speech emotion recognition (SER). Often, these models tend to show mixed results when presented with speakers or acoustic conditions that were not visible during training. This paper investigates the impact of cross-corpus data complementation and data augmentation on the performance of SER models in matched (test-set from same corpus) and mismatched (test-set from different corpus) conditions. Investigations using six emotional speech corpora that include single and multiple speakers as well as variations in emotion style (acted, elicited, natural) and recording conditions are presented. Observations show that, as expected, models trained on single corpora perform best in matched conditions while performance decreases between 10-40% in mismatched conditions, depending on corpus specific features. Models trained on mixed corpora can be more stable in mismatched contexts, and the performance reductions range from 1 to 8% when compared with single corpus models in matched conditions. Data augmentation yields additional gains up to 4% and seem to benefit mismatched conditions more than matched ones.
翻译:能够处理多种语言和声学条件的模型在语音情感识别(SER)中至关重要。这些模型通常在与演讲者或培训期间无法见的声学条件一起展示时显示结果喜忧参半。本文调查跨体数据补充和数据增强对SER模型在匹配(来自同一体的测试设置)和不匹配(来自不同体的测试设置)条件下的性能的影响。使用包括单一和多个语言以及情感风格(活动、诱导、自然)和记录条件变化的6个情感语言体体(包括单一和多重语言)进行调查。观察显示,按照预期,在不匹配条件下,在不匹配条件下,对单一体进行训练的模型表现最佳,而性能下降在10-40%之间,视物理特征而定。在不匹配的情况下,对混合体的模型培训可以更加稳定,与符合条件的单一体型相比,性能下降幅度从1%到8%不等。数据增强能产生高达4%的额外收益,对不匹配条件的受益似乎多于匹配条件。