The prediction of valence from speech is an important, but challenging problem. The externalization of valence in speech has speaker-dependent cues, which contribute to performances that are often significantly lower than the prediction of other emotional attributes such as arousal and dominance. A practical approach to improve valence prediction from speech is to adapt the models to the target speakers in the test set. Adapting a speech emotion recognition (SER) system to a particular speaker is a hard problem, especially with deep neural networks (DNNs), since it requires optimizing millions of parameters. This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set. Speech samples from the selected speakers are used to create the adaptation set. This approach leverages transfer learning using pre-trained models, which are adapted with these speech samples. We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches. These methods differ on the use of the adaptation set in the personalization of the valence models. The results demonstrate that a valence prediction model can be efficiently personalized with these unsupervised approaches, leading to relative improvements as high as 13.52%.
翻译:语音价值的预测是一个重要但具有挑战性的问题。 语音价值的外部化是一个重要但具有挑战性的问题。 语音价值的外化有依赖语音的提示,这导致表演率往往大大低于对其他情感属性的预测,例如振动和支配力的预测。 改进语音价值预测的实用方法是将各种模型适应测试集中的目标发言者。 将语音情绪识别系统适应特定发言者是一个棘手的问题, 特别是深层神经网络( DNNS), 因为它需要优化数百万参数。 本研究报告建议采用一种不受监督的方法解决这一问题, 在火车上寻找与测试集中发言者具有类似声学模式的发言者。 使用选定发言者的语音样本来创建适应数据集。 这种方法利用预先培训的模型转移学习,这些模型与这些语音样本相适应。 我们提出了三种备选适应战略: 独特的发言者、 过度采样和加权方法。 这些方法与在价值模型个性化中的适应数据集的使用不同。 研究结果表明, 价值预测模型可以高效地个人化, 52%, 作为高超近的改进方法。