Despite the recent advancement in speech emotion recognition (SER) within a single corpus setting, the performance of these SER systems degrades significantly for cross-corpus and cross-language scenarios. The key reason is the lack of generalisation in SER systems towards unseen conditions, which causes them to perform poorly in cross-corpus and cross-language settings. Recent studies focus on utilising adversarial methods to learn domain generalised representation for improving cross-corpus and cross-language SER to address this issue. However, many of these methods only focus on cross-corpus SER without addressing the cross-language SER performance degradation due to a larger domain gap between source and target language data. This contribution proposes an adversarial dual discriminator (ADDi) network that uses the three-players adversarial game to learn generalised representations without requiring any target data labels. We also introduce a self-supervised ADDi (sADDi) network that utilises self-supervised pre-training with unlabelled data. We propose synthetic data generation as a pretext task in sADDi, enabling the network to produce emotionally discriminative and domain invariant representations and providing complementary synthetic data to augment the system. The proposed model is rigorously evaluated using five publicly available datasets in three languages and compared with multiple studies on cross-corpus and cross-language SER. Experimental results demonstrate that the proposed model achieves improved performance compared to the state-of-the-art methods.
翻译:尽管最近在单一的体格环境中的语音情感识别(SER)最近有所进步,但这些SER系统的性能却在跨体和跨语言假设情景中显著降低,关键的原因是SER系统缺乏对无形条件的概括化,导致其在跨体和跨语言环境中的表现不佳。最近的研究侧重于利用对抗性方法来学习通用的域代表性,以改进跨体和跨语言SER来解决这一问题。然而,许多这些方法只侧重于跨体体SER,而没有解决跨语言SER性能退化问题,因为源和目标语言数据之间存在更大的域间差距。这一贡献提出在SER系统中缺乏对抗性的双重歧视者(ADDDI)网络,使用三玩者对抗性游戏来学习通用的表述,而不需要任何目标数据标签。我们还引入一个自我监督的ADDDI(sADDDI)网络,利用不贴标签的数据进行自我监督的预培训。我们提议在SADDI中将合成数据生成作为借口,使网络能够产生情感歧视性和领域差异性的双重歧视性(ADDDI)数据生成,并且用公开的合成语言进行对比性化的合成实验研究,在现有的三种合成语言上进行对比性分析。