Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for unsupervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective and subjective evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations. Some audio samples can be found on our demo website.
翻译:人类的言语可以有不同的特征,包括语义内容、语音身份和口头信息。在自动语音识别(ASR)和语音验证任务中,在解析语义内容和语音特征的表达方式方面已经取得重大进展。然而,由于诸如音调和节奏等不同属性的内在联系,以及由于需要未经监督的培训计划来实现强大的大规模和依赖语言的ASR。本文件的目的是解决基于不受监督的重建功能的言语表达方式和语音特征的分解问题。具体地说,我们确定、设计、实施和整合了拟议语音重建模式Prosody2 Vec(Prosocial recodition)的三个关键组成部分: (1) 将言语信号转换成松散的语义信息,(2) 预先训练的演讲者验证模式可以产生声音的嵌入,(3) 具有训练性的言语义解调的言语义表达方式,通过在未加标签的言语义演化实验中,在不折变现的言语义演说中,先通过精确的质演化模型2C的言语义表达方式展示,然后在未加固的言义化的言语义化的言语义性变变变中,在不化的言语义的言义中,在特定的言语义的言语义学上,在具体的言语义的言调的言调的言语文文文义上显示中,在具体的言语义学上显示中,在具体的言语文文文文文文文文文义学上,在特定的言语义学上,在具体的言语义学上,在具体的言语义学上,在具体的言语义学上,在具体的言语义学上,可以发现的言语义学上的言语义学上的言义学上,然后的言义学上的言义的言义学上进行上,然后的言义学上,在演说义学上的言义学上的言义学上进行中,然后,用的言义学上,在进行中,在演说,在演说义学上的言行的言义学上,然后的言义学上的言义学上,在演说义学的言行的言行的言语义学上,然后的言行