Speech self-supervised models such as wav2vec 2.0 and HuBERT are making revolutionary progress in Automatic Speech Recognition (ASR). However, they have not been totally proven to produce better performance on tasks other than ASR. In this work, we explored partial fine-tuning and entire fine-tuning on wav2vec 2.0 and HuBERT pre-trained models for three non-ASR speech tasks: Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding. With simple proposed downstream frameworks, the best scores reached 79.58% weighted accuracy on speaker-dependent setting and 73.01% weighted accuracy on speaker-independent setting for Speech Emotion Recognition on IEMOCAP, 2.36% equal error rate for Speaker Verification on VoxCeleb1, 89.38% accuracy for Intent Classification and 78.92% F1 for Slot Filling on SLURP, showing the strength of fine-tuned wav2vec 2.0 and HuBERT on learning prosodic, voice-print and semantic representations.
翻译:诸如 wav2vec 2. 0 和 HuBERT 等自我监督的演讲模式在自动语音识别方面正在取得革命性的进展。 但是,这些模式并没有被完全证明能产生比ASR更好的业绩。 在这项工作中,我们探讨了对 wav2vec 2.0 和 HuBERT 三个非ASR 演讲任务进行部分微调和整个微调:语音识别、发言人核查和口语理解等经过培训的模式。在简单提议的下游框架下游框架下,最优得分达到依赖发言设置的加权精度79.58%,对IDEMOCAP 的言语识别设置的加权精度为73.01%。 VoxCeleb1、Inttelecation 准确度为2.36%;SLURP 填充缩放为78.92% F1,显示微调的 wav2vec 2.0 和HuBERT在学习Prosodic、语音和语义表达方面的力量。