Speech models have long been known to overfit individual speakers for many classification tasks. This leads to poor generalization in settings where the speakers are out-of-domain or out-of-distribution, as is common in production environments. We view speaker adaptation as a few-shot learning problem and propose investigating transfer learning approaches inspired by recent success with pre-trained models in natural language tasks. We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives. We pre-finetune Wav2Vec2.0 on every permutation of four multiclass emotional speech recognition corpora and evaluate our pre-finetuned models through 33,600 few-shot fine-tuning trials on the Emotional Speech Dataset.
翻译:长期以来,人们一直认为演讲模式过分适合个别演讲者执行许多分类任务,这导致在演讲者不在主场或分配之外的环境中,如生产环境中常见的情况那样,没有很好地一概而论。我们认为,演讲者适应是一个微小的学习问题,并提议根据最近通过在自然语言任务方面经过预先培训的模型所取得的成功,对转让学习方法进行调查。我们提出了关于将知识转化为少数下游分类目标的艰巨任务的预先调整演讲模式。我们预先对四个多级情感言语识别的每一个变换进行了研究,并通过对情感语音数据集进行的33 600次微调试验,对我们预先调整的模式进行了评估。</s>