Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features. Several SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are performed. These experiments investigate interactions among fine-tuning of self-supervised feature models, aggregation of frame-level features into utterance-level features and back-end classification networks. The proposed monomodal speechonly based system not only achieves SOTA results, but also brings light to the possibility of powerful and well finetuned self-supervised acoustic features that reach results similar to the results achieved by SOTA multimodal systems using both Speech and Text modalities.
翻译:自我监督的预先培训的特征在自然语言处理(NLP)领域始终提供了最先进的成果;然而,在语音情绪识别(SER)领域,这些特征的优点仍需进一步调查。在本文件中,我们引入了一个基于上流+下游架构范式的模块式端到端系统(E2E)SER(SER)系统,该系统便于大量自监管的特征的简单使用/整合。还进行了若干次SER实验,用于预测IEMOCAP数据集中的绝对情感类别。这些实验调查了自我监督特征模型的微调、将框架级特征汇总到发音级别特征和后端分类网络之间的相互作用。拟议的单调单调语音系统不仅能够取得SOTA结果,而且还为使用语音和文本模式的SOTA多式联运系统取得类似结果的强大和精心调整的自我监督的语音特征提供了光彩的可能性。