Pre-trained model representations have demonstrated state-of-the-art performance in speech recognition, natural language processing, and other applications. Speech models, such as Bidirectional Encoder Representations from Transformers (BERT) and Hidden units BERT (HuBERT), have enabled generating lexical and acoustic representations to benefit speech recognition applications. We investigated the use of pre-trained model representations for estimating dimensional emotions, such as activation, valence, and dominance, from speech. We observed that while valence may rely heavily on lexical representations, activation and dominance rely mostly on acoustic information. In this work, we used multi-modal fusion representations from pre-trained models to generate state-of-the-art speech emotion estimation, and we showed a 100% and 30% relative improvement in concordance correlation coefficient (CCC) on valence estimation compared to standard acoustic and lexical baselines. Finally, we investigated the robustness of pre-trained model representations against noise and reverberation degradation and noticed that lexical and acoustic representations are impacted differently. We discovered that lexical representations are more robust to distortions compared to acoustic representations, and demonstrated that knowledge distillation from a multi-modal model helps to improve the noise-robustness of acoustic-based models.
翻译:在语音识别、自然语言处理和其他应用方面,经过培训的模型表现展示了最先进的语音识别、自然语言处理和其他应用表现。诸如来自变异器和隐藏单元BERT(HuBERT)的双向编码显示器等演讲模型模型模型模型表现方式,使得能够产生有利于语音识别应用的词汇和声学表现方式。我们调查了使用经过培训的模型表现方式来估计来自演讲的维度情绪,例如激活、valence和支配力。我们发现,虽然价值可能严重依赖词汇表达方式,但激活和主导力主要依赖声学信息。在这项工作中,我们使用来自预先培训的模型的多式聚合表示方式来生成最先进的语音情绪估计,而且我们显示,与标准的声学和词学基线相比,在一致性相关系数估算方面出现了100%和30%的相对改善。最后,我们调查了经过培训的模型表现方式对于噪音和重新校正退化的稳健性,我们发现,基于词汇和声学的表达方式的影响不同。我们发现,与声学表现方式的模型相比,我们发现,与声学表现方式相比,比较,比较,比较,比较好的模型比,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,我们改进的声学模型,比较,我们显示,比,比,比,我们,比较,比较,比,比,比较,比,比,改进了多式制制式制的压,比较,比较,我们,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较,比较</s>