Speech Emotion Recognition (SER) refers to the recognition of human emotions from natural speech. If done accurately, it can offer a number of benefits in building human-centered context-aware intelligent systems. Existing SER approaches are largely centralized, without considering users' privacy. Federated Learning (FL) is a distributed machine learning paradigm dealing with decentralization of privacy-sensitive personal data. In this paper, we present a privacy-preserving and data-efficient SER approach by utilizing the concept of FL. To the best of our knowledge, this is the first federated SER approach, which utilizes self-training learning in conjunction with federated learning to exploit both labeled and unlabeled on-device data. Our experimental evaluations on the IEMOCAP dataset shows that our federated approach can learn generalizable SER models even under low availability of data labels and highly non-i.i.d. distributions. We show that our approach with as few as 10% labeled data, on average, can improve the recognition rate by 8.67% compared to the fully-supervised federated counterparts.
翻译:情感言语识别(SER)是指对自然言论中人类情感的认知。如果准确,它可以在建立以人为本的背景认知智能系统方面提供一些好处。现有的SER方法基本上集中,没有考虑到用户隐私。Fal Learning(FL)是一个分散的机器学习模式,涉及隐私敏感个人数据的分散化。在本文中,我们通过使用FL的概念,展示了一种隐私保护和数据效率高的SER方法。根据我们的知识,这是第一个联合的SER方法,它利用结合联合学习的自我培训学习来利用贴标签和未贴标签的装置数据。我们对IEMOC数据集的实验评估表明,即使数据标签不足和高度非i.i.i.d.分布,我们的Federate方法也可以学习通用SER模型。我们显示,平均只有10%的贴标签数据,我们的方法可以提高8.67%的识别率,而完全超固的Federederederite对应方则可以提高8.67%的识别率。