In this manuscript, the topic of multi-corpus Speech Emotion Recognition (SER) is approached from a deep transfer learning perspective. A large corpus of emotional speech data, EmoSet, is assembled from a number of existing SER corpora. In total, EmoSet contains 84181 audio recordings from 26 SER corpora with a total duration of over 65 hours. The corpus is then utilised to create a novel framework for multi-corpus speech emotion recognition, namely EmoNet. A combination of a deep ResNet architecture and residual adapters is transferred from the field of multi-domain visual recognition to multi-corpus SER on EmoSet. Compared against two suitable baselines and more traditional training and transfer settings for the ResNet, the residual adapter approach enables parameter efficient training of a multi-domain SER model on all 26 corpora. A shared model with only $3.5$ times the number of parameters of a model trained on a single database leads to increased performance for 21 of the 26 corpora in EmoSet. Measured by McNemar's test, these improvements are further significant for ten datasets at $p<0.05$ while there are just two corpora that see only significant decreases across the residual adapter transfer experiments. Finally, we make our EmoNet framework publicly available for users and developers at https://github.com/EIHW/EmoNet. EmoNet provides an extensive command line interface which is comprehensively documented and can be used in a variety of multi-corpus transfer learning settings.
翻译:在本手稿中,从深层传输学习的角度探讨多功能语音情感认识(SER)的主题。大量情感语音数据集(EmoSet)来自现有的多个SER Corsora 。总共,EmoSet 包含26 SER Corsora 的84181个录音,总持续时间超过65小时。然后,该文集用于创建多功能语音认识的新框架,即EmoNet。一个深层ResNet架构和剩余适应器的组合,从多功能视觉识别领域转移到EmoSet的多功能SER。与ResNet的两个合适的基线和较传统的培训和传输设置相比,剩余调整器方法使得26 SER Corso总共能对多功能语音情感认识(共超过65小时)进行参数高效培训。一个共享模型,只有在单一数据库中培训过的模型参数数的35倍。一个共享的ResNetNet网络架构,可以提高EmoSNetSER的21个功能。 McNemar的测试测量了多功能环境网络的功能,这些改进对于10个远程指令用户来说是更有意义的,最后使用EmalmoOlalalalal 5。在OUPER 上可以使用。