Emotions lie on a broad continuum and treating emotions as a discrete number of classes limits the ability of a model to capture the nuances in the continuum. The challenge is how to describe the nuances of emotions and how to enable a model to learn the descriptions. In this work, we devise a method to automatically create a description (or prompt) for a given audio by computing acoustic properties, such as pitch, loudness, speech rate, and articulation rate. We pair a prompt with its corresponding audio using 5 different emotion datasets. We trained a neural network model using these audio-text pairs. Then, we evaluate the model using one more dataset. We investigate how the model can learn to associate the audio with the descriptions, resulting in performance improvement of Speech Emotion Recognition and Speech Audio Retrieval. We expect our findings to motivate research describing the broad continuum of emotion
翻译:情感存在于一个广泛的连续体上,将情感当作一个分立的等级,限制了模型捕捉连续体中细微差别的能力。 挑战是如何描述情感的细微差别,以及如何使模型能够学习描述。 在这项工作中,我们设计了一种方法,通过计算声学特性,例如声、声、语音、语音和音频表达率,自动创建给定音频的描述(或提示)。我们用5个不同的情感数据集将感和相应的音频配对。我们用这些音频文本对子训练了一个神经网络模型。然后,我们用一个数据集来评估模型。我们研究模型如何学会将音频与描述联系起来,从而改进语音识别和语音音频检索率的性能。我们期望我们的调查结果能够激励研究描述广泛的情感连续体。