Researchers have recently started to study how the emotional speech heard by young infants can affect their developmental outcomes. As a part of this research, hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia in the context of so-called APPLE study. In order to analyze the emotional content of speech in such a massive dataset, an automatic speech emotion recognition (SER) system is required. However, there are no emotion labels or existing indomain SER systems to be used for this purpose. In this paper, we introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data. We explore the effectiveness of alternative state-of-the-art techniques to deploy a SER system to a new domain, comparing cross-corpus generalization, WGAN-based domain adaptation, and active learning in the task. As a result, we show that the best-performing models are able to achieve a classification performance of 73.4% unweighted average recall (UAR) and 73.2% UAR for a binary classification for valence and arousal, respectively. The results also show that active learning achieves the most consistent performance compared to the two alternatives.
翻译:最近,研究人员开始研究年轻婴儿听到的情绪性言语如何影响他们的发展成果,作为这项研究的一部分,芬兰和爱沙尼亚的两个医院在所谓的APPLE研究中收集了婴儿预产期听力环境的数百小时长的全天录音记录。为了分析如此庞大的数据集中言论的情感内容,需要自动语音感应识别系统(SER),然而,没有情感标签或现有的内在SER系统可用于这一目的。在本文中,我们首先介绍了这一未经附加说明的大规模真实世界音频数据集,并描述了芬兰数据组的功能性SER系统的发展。我们探讨了将SER系统应用到新领域的替代状态技术的有效性,比较了跨公司通用、基于WGAN的域适应和积极学习的任务。结果显示,最佳模型能够达到73.4%的未加权平均回顾(UAR)和73.2% UAR的分类性能,用以对芬兰数据组进行最连续的二进制学习结果进行对比。我们还展示了两种对比性结果。